The question is: Are there any articles or blogs or whatever on how to massively debug "theoretically" written grammar? The difficulty is in Russian many sentences make sense, but I found that sometimes it's only a coincidence. My guess is I need some reliable (and maybe hand-checked?) treebanks. Any ideas? The main point is syntax. Lexicon is pretty straightforward to debug.
In some cases I failed to understand what meaning RG is supposed to convey. The first such example is few_Det. I checked English and Finnish, but still not quite sure. Or take CleftAdv. (here Finnish helped: This gives the almost forbidden "se on Porissa kun Matti asuu". But does not increase confidence still as it's in the grey area of my language intuition. As a dev suggestion, there could be some "semantically universal" description of each construct way so developers can check results in their language. Maybe, some /semantic pseudo-language can explain things in simple English or something.
Also, I've not understood clear division between Morpho and Res modules. In what I did ResRus.gf is bloated, but MorphoRus is very small, plus I have a separate module for special kind of Morpho. Any advice here?
One more question is whether it's possible to have embedInCommas for the beginning of the sentence as well? At the moment it does not "sense" being the first thing, only punctuation. Maybe, it's trivial to add to RGL or new term for "beginning" is needed?
Then also a question what are acceptance criteria for inclusion in official RGL? Now that there already is Russian, what is the way upgrade? Will it be "russian2" or replace old "russian"? I am aiming at total replacement, but it will not be backwards compatible because I streamlined things and some obscure cases should be served differently (in what I see as much less hackish way). I see something like exper under Fre/Romance, but in Rus case they are very different. I see no point in adapting one to another.
1. Seems like applications use ResRus internal functions a lot, so I added more functions, however, regV can't be reliably redone because old grammar used suboptimal set of 5 forms (new grammar uses max three for verbs - inf, SgP1, SgP3, and it's enough for most of verbs). Old grammar does not have SgP3, so there is no good way to emulate old regV... New grammar has built-in conjugation checker: I found 2-3 conjugation mistakes in my project with it.
3. I have hard time to understand why something as simple as "I eat an apple" is parsed into a monster like:UseCl (TTAnt TPast ASimul) PPos (PredVP (AdvNP (UsePron i_Pron) (PrepNP part_Prep (DetNP (DetQuant DefArt NumSg))))
(AdvVP (ComplSlash (AdvVPSlash (VPSlashPrep (ComplSlash (SlashV2a eat_V2) (DetNP (DetQuant DefArt NumSg))) part_Prep)
(PrepNP part_Prep (DetNP (DetQuant DefArt NumSg))))
(AdvNP (DetNP (DetQuant DefArt NumSg)) (PrepNP part_Prep (DetNP (DetQuant DefArt NumSg)))))
(PrepNP part_Prep (AdvNP (MassNP (ApposCN (UseN apple_N) (AdvNP (DetNP (DetQuant IndefArt NumPl))
(PrepNP possess_Prep (DetNP (DetQuant DefArt NumPl))))))
(PrepNP possess_Prep (DetNP (DetQuant DefArt NumSg)))))))It seems to me that making adverbs from everything gives too much freedom and muddies everything.
But then when I get hint from Finnish I can find the short way (probably even shortest):AllRusAbs> l UseCl (TTAnt TPast ASimul) PPos (PredVP (UsePron i_Pron) (ComplSlash (SlashV2a eat_V2) (MassNP (UseN apple_N))))
я кушал яблокоI am wondering how to deal with parsing. Maybe, I should remove some functions from parsing.
Hi,1. Seems like applications use ResRus internal functions a lot, so I added more functions, however, regV can't be reliably redone because old grammar used suboptimal set of 5 forms (new grammar uses max three for verbs - inf, SgP1, SgP3, and it's enough for most of verbs). Old grammar does not have SgP3, so there is no good way to emulate old regV... New grammar has built-in conjugation checker: I found 2-3 conjugation mistakes in my project with it.That's unfortunate. Which applications are you talking about?
If you want to be kind to the old applications, you could leave the internal opers with the same name and type signature, but just not use the arguments. If the old oper is called regV and it takes 5 forms a,b,c,d,e, and your new version takes some arguments a,b,f, then you could do something along the linesgoodRegV : (a,b,f : Str) -> Verb = … -- your new implementationregV : (a,b,c,d,e) -> Verb = \a,b,c,d,e ->let f = guessF a b … ;in goodRegV a b f ;
3. I have hard time to understand why something as simple as "I eat an apple" is parsed into a monster like:UseCl (TTAnt TPast ASimul) PPos (PredVP (AdvNP (UsePron i_Pron) (PrepNP part_Prep (DetNP (DetQuant DefArt NumSg))))
(AdvVP (ComplSlash (AdvVPSlash (VPSlashPrep (ComplSlash (SlashV2a eat_V2) (DetNP (DetQuant DefArt NumSg))) part_Prep)
(PrepNP part_Prep (DetNP (DetQuant DefArt NumSg))))
(AdvNP (DetNP (DetQuant DefArt NumSg)) (PrepNP part_Prep (DetNP (DetQuant DefArt NumSg)))))
(PrepNP part_Prep (AdvNP (MassNP (ApposCN (UseN apple_N) (AdvNP (DetNP (DetQuant IndefArt NumPl))
(PrepNP possess_Prep (DetNP (DetQuant DefArt NumPl))))))
(PrepNP possess_Prep (DetNP (DetQuant DefArt NumSg)))))))It seems to me that making adverbs from everything gives too much freedom and muddies everything.Allowing empty NPs is the root of the problem here, rather than adverbs. Since Russian doesn't have articles, it's understandable that DetNP for DefArt and IndefArt is just empty string.DetNP makes sense for determiners like "this", "that" or "mine", "yours". It makes no sense for most other dets. Some RGs try to follow the principle "let's try to make most of the RGL trees make sense", and linearise DetNP some_Det into "something", DetNP (DetQuant IndefArt NumSg) into "one", DetNP (DetQuant DefArt NumSg) into "this", and so on. Other languages try follow the principle "as long as there is some way in the RGL to say what I want, I don't care if some other trees linearise into nonsense".Now, due to the ambiguity problems, I would recommend to include some nonempty string in all Dets. It should be separate from the s field, because you don't want "an apple" to be translated into "одно яблоко", but just "яблоко". But when (DetQuant IndefArt NumSg) is given to DetNP, then the non-empty string in the other field should be chosen. Same treatment for the definite article. Downside for picking words like "one" and "this" is that (DetNP (DetQuant this_Quant NumSg) would become ambiguous. If you can find a word that makes sort of sense and doesn't make it ambiguous, that'd be an improvement. But just introducing any string, even if it is "this" and "one", is better than getting such monstrous parses that you are getting.
But then when I get hint from Finnish I can find the short way (probably even shortest):AllRusAbs> l UseCl (TTAnt TPast ASimul) PPos (PredVP (UsePron i_Pron) (ComplSlash (SlashV2a eat_V2) (MassNP (UseN apple_N))))
я кушал яблокоI am wondering how to deal with parsing. Maybe, I should remove some functions from parsing.The most important thing is that determiners have some non-empty string that DetNP can choose. That should get rid of most of the problems.But some other constructions are also very overgenerating, like ApposCN. For internal purposes only, I sometimes insert some dummy string when I don't want ApposCN to pollute my parses. Like this:ApposCN cn np = cn ** {s=\\n,cas => cn.s ! n ! cas ++ "_" ++ np.s ! cas} ;A concrete use case: I'm writing an application grammar, and I want to find some construction that I know is in ExtendEng, but I don't remember what it is. I parse a sentence in AllEng, then I get tons of things that are garbage. I can't comment out the overgenerating functions, because they are used by the API, so instead add some string to their linearisations so that they won't show up in my parses. If I do want a tree with the function I normally don't want, I can just insert that character, like "city _ Paris".
Inari
--. --2 Definitions of paradigms |
-- The definitions should not bother the user of the API. So they are hidden from the document. |
--
---
You received this message because you are subscribed to the Google Groups "Grammatical Framework" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gf-dev+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gf-dev/b985c2dc-ec44-4fc9-9114-299d0c1ca860o%40googlegroups.com.
One more problem is with the word "like", which new grammar defines like this:like_V2 = mkV2 (mkV imperfective intransitive "нравиться" "нравлюсь" "нравится") Dat ;Note, that this verb does not have non-reflexive form old grammar tries to use to mimic English like. So in new grammar the meaning of the verb is opposite.
Lang> p "I like grammars" | l
I like grammars
yo gusto gramáticas -- should be "me gustan gramáticas"
This is not a massive problem per se, the application grammarian just needs to know this is the case. If your application grammar has a function like the following:
fun Like : NP -> NP -> Cl ;
then you just need to linearise it as follows:
lin Like subj obj = mkCl obj like_V2 subj ;
Maybe, due to importance of "X likes Y" there maybe common abstract constructor, which will hide implementation details, as it is with have_name_Cl , and even less important cup_of_CN .
Now that I compared old and new grammars, new one seems to be way more accurate (even though there could be some bugs in it still).One problematic moment is that infinitive sentences (via SC?) are not that advanced, even for English.For example, this does not work:p -cat=Text "how to talk ?"
and phrase likep -cat=Text "to live is to fly ."is supported in English via some obscure extension (InOrderToVP),
[…]
So now I am thinking how to simplify forming indefinite sentences in Russian without adding new tenses, or even how to introduce a new mood for that, or whether something useful can be found in Extend.
even though SC promises to be subject and object.
I also do not quite understand what "well-formed" grammar should have at minimum.
For example, for documentation. Finnish uses functor for that, old Russian does not. I may need help with those once the new grammar is ready.
ед.ч. | нечленувано | котка |
---|---|---|
членувано | котката | |
пълен член | котката | |
мн.ч. | нечленувано | котки |
членувано | котките | |
звателна форма | котко | |
бройна форма | котки |
I am still not understanding Extend fully, it seems.
I used to have AllRus.gf like:concrete AllRus of AllRusAbs = LangRus, ExtraRus ** open ExtendRus in {flags coding=utf8;}but then ExtendRus was not available at all. (by that I mean eg iFem_Pron could not be used in the shell).
abstract AllRusAbs = Lang, ExtraRusAbs ;concrete AllRus = LangRus, ExtraRus ** open ExtendRus
abstract AllRusAbs = Lang, ExtraRusAbs ;concrete AllRus = LangRus, ExtendRus
abstract AllRusAbs = Lang, ExtraRusAbs, Extend ;concrete AllRus = LangRus, ExtraRus, ExtendRus ;
But for example in Finnish I seeconcrete AllFin of AllFinAbs =
LangFin - [SlashV2VNP,SlashVV, TFut], ---- to speed up linking; to remove spurious parses
ExtraFin - [ProDrop, ProDropPoss, S_OSV, S_VSO, S_ASV, AdvExistNP] -- to exclude spurious parses
** open ExtendFin in {} --- to make it compile by defaultI have not analysed it further, but there certainly are reasons why it was done this way. "to make it compile by default" does not bring more understanding. Fortunately, new Russian RG is quite light (or so is my feeling when loading it compared to Finnish) - thanks to Basque-glueing approach and using records.
Am thinking how to introduce Russian-specific word-order variations, part of speech (transgressive), wider possibilities for impersonal and infinitive VPs (Russian is really rich in those), some minor tweaks to prepositions varying according to "pre" (this can partially be left out to applications even because rules are sometimes semantic). To my mind, current abstract RG is not giving enough room to those VP or is too English-centric, but I have not yet checked every Extend entry.
One specific problem I have is "impolite" or "order" imperative, which grammatically is literally just "to do something". It is the main mode used in computer interfaces (in addition to dogs and other "immediate imperative" situations). I can bind it to Imp Sg P1 (which is the nearest that can be used when someone gives an order to herself/himself/itself), but I have not found API path for that yet.
% grep PartNP Constructors.gf= PPartNP ; --%
concrete MyGrammarXxx of GrammarXxx = Foo, Bar ** openSyntaxXxx,ParadigmsXxx,(N=NounXxx) in {linMyFunction a b c d = … (N.PartNP a (mkNP b) … ;}
I am not sure how to use gftest for the main test suite as it gets stuck at ListNP for long time (is it supposed to run for hours?), but "lexicon-only" cases look good.
Now also DocumentationRusFunctor is nearing completion. Here another question is whether it (will be) visible on the web or is it just a resource used (how?) in applications?
Great things above for documentation! Maybe this correspondence can make a good blog entry.
1. Are there any better mechanism to see how API maps to low-level functions?
2. Sometimes parsing helps, but I see now new Russian grammar got so many extensions, that parsing sometimes does not work (actually, it's not different from the old one) - maybe, there is some possibility for parsing to switch from depth first to BFS? (I dont know what is behind search algo, but for some reason very obscure and long trees with piled up adverbs come up first always. Sometimes I wait till the end and useful and shortest ones are there.)
Another unexpected use case for ma
is if your grammar is extremely ambiguous. Say you have a few empty strings in strategic places, or are using a large lexicon with tons of synonyms–you’ll easily get hundreds or even thousands of parses for a short sentence. Example:
Lang> p "작은 고양이가 좋습니다" | ? wc
359 11232 86068
Lang> ma "작은 고양이가 좋습니다"
작은
small_A : s (VAttr Pos)
short_A : s (VAttr Pos)
고양이가
cat_N : s Subject
좋습니다
good_A : s (VF Formal Pos)
This is much more readable than 359 trees. The subject is a small or short cat, and the predicate is that the cat is good. Just by seeing the morphological parameters from the inflection tables, we can infer that small
is attributive and good
is predicative.
4. Symbol. Have I understood right that it's not included in AllXXX and applications need to import it explicitly? The problem is I do not know how to test it with something meaningful. Luckily, it's quite simple.