On Russian RG-2.0 early development

66 views
Skip to first unread message

Roman Suzi

unread,
Jun 22, 2020, 12:39:12 AM6/22/20
to Grammatical Framework
Hi!

This time I have some code for the next generation Russian RGL to show: https://github.com/rnd0101/gf-rgl-russian2

There is no support for anything but noun paradigms descriptions at the moment, plus it's not totally bug free, but it is based on a mainstream approach - so-called "Zalizniak algorithm" (I am using subset of Zalizniak index as it covers significant portion of Russian words. The algo itself covers 100%.) Indices can be easily found on wiktionary, individual words eg here: http://gramdict.ru/ 

GF proved to be a good tool for the job so far except for small nuisances:
1. It remains unclear to me why records depend on the order of constituents sometimes ( causing extra effort with describing "worst case" words https://github.com/rnd0101/gf-rgl-russian2/blob/master/src/russian2/ParadigmsRus.gf#L63 )
2. Sometimes explicit type needs to be indicated where I'd expected it to be deduced with type system (I do not remember where it was, in the middle of some function )
3. Not sure why it's not possible to take a record directly in concrete grammar, so some semi-dummy workaround with an extra function was needed like here: https://github.com/rnd0101/gf-rgl-russian2/blob/master/src/russian2/LexiconRus.gf#L60

Overall, I am surprised myself how far I was able to move into this. In my opinion it's a very clear improvement over what is now in Russian RG on all fronts. I used frequencies of appearances of certain declension styles to focus on, not Swadesh list, so picking uniformly random Russian word will likely work automatically. However, most frequent Russian nouns list is not covered 100% by automatics, which IMHO is also fine as it can be in the DictRus2 and finetuned manually. (old dictrus contains so many errors, that I'd just removed it)

I hope to do more. The main idea is, that paradigms should have more flexible connection with the rest. Even with Zalizniak it's clear that some Russian nouns will have declensions of adjectives (уп http://gramdict.ru/search/%D0%BF%D1%80%D0%BE%D0%B8%D0%B7%D0%B2%D0%BE%D0%B4%D0%BD%D0%B0%D1%8F ) - so the lexicon description references adjective-style declension type. Same for some pronouns, etc. So at the bottom of making word forms different abstractions are used, not parts of speech. (well, they mostly follow parts of speech, but not 100%). In other words, in Russian it may also well be that declension follows different gender than the word is considered to be in, so producing forms should be somewhat disconnected anyway.

Please, comment if something looks terribly wrong. (noun linearization is temporarily set to show all forms as debug). For example, I decided to include so-called minor cases explicitly instead of having special-purpose workarounds in the old RG, eg:

forest_N = (mkNplus (mkN "лес" masculine inanimate (ZC 1 No C ZC1))) ** {sloc="лесу"} ;

illustrates approach of having what was previously quite irregular word. While this kind of description looks complex, it's not complicated and is very straightforward from what


gives. (the ZC 1 No C ZC1 part is just an encoding of the index ( 1c(1) )  , and locative minor case also mentioned to be present in the dictionary entry.)
The only ugly part is that mkNplus, which I am not sure how to omit, but maybe it's even better this way. Many words can still be guesses with just: mkN "слово" , some more with adding gender/animacy. "Worst case" can be given with a record. I have not yet encountered words, which would require it.

With best regards,
Roman
PS. this is my personal project, not related to work

John J. Camilleri

unread,
Jun 24, 2020, 5:50:43 AM6/24/20
to gf-...@googlegroups.com
Hi Roman,

Good to see you are making good progress! When can we expect the pull request for the new Russian RG? 😉

The requirement for mkNplus (who's implementation is identity) is indeed strange and I don't quite know why it's needed. Other things which work:

feather_N = let n: N = (mkN "перо" neuter inanimate (Z 1 No D)) in n ** {pnom="перья"};
feather_N = <mkN "перо" neuter inanimate (Z 1 No D):N> ** {pnom="перья"};

In fact it seems to be a problem with the overload, because this also works:

oper
  mkkN : Str -> Gender -> Animacy -> ZIndex -> N
    = \word, g, a, z -> lin N (noMinorCases (makeNoun word g a z)) ;

lin
  feather_N = mkkN "перо" neuter inanimate (Z 1 No D) ** {pnom="перья"};

Indeed this is why it works in Czech. Maybe someone else can shed more light on this.

I'm not sure what you mean about the order of constituents mattering in record extension. The following expressions give me the exact same thing:

  feather_N = mkkN "перо" neuter inanimate (Z 1 No D) ** {pgen="YYY";pnom="XXX"};
  fingernail_N = mkkN "перо" neuter inanimate (Z 1 No D) ** {pnom="XXX";pgen="YYY"};

John

--

---
You received this message because you are subscribed to the Google Groups "Grammatical Framework" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gf-dev+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gf-dev/94fcaf3a-5f51-431a-b9d5-4b72046ab0fbo%40googlegroups.com.

Roman Suzi

unread,
Jun 24, 2020, 4:19:52 PM6/24/20
to gf-...@googlegroups.com
Hi John,

Regarding workarounds: Yes, I guess those will work, thanks. The problem with workarounds for feather_N is that it makes lexicon description more verbose for no obvious reason.

As for problems with the order, they are not about changing 2-3 cases afterwards as in case of forest_N (it simply works as expected), but to describe the worst case: All cases, using this overload:

mkN : NounForms -> N
= \nf -> lin N nf ;

This one seems to work only when the order is exactly the same as in NounForms. I am not quite sure why there is such a limitation. I suspect type checking oddities as {pgen="YYY";pnom="XXX"} and {pnom="XXX";pgen="YYY"} should be of the same type.
Records with explicit "keys" are much more readable and ergonomical than just a sequence of strings because the "context" - a case - is explicit unlike "worst case" mkN in the old Russian RG. One way of course is to use mkN : Str -> Gender -> Animacy -> N function and then just overwrite all cases. So probably I will just remove that mkN : NounForms -> N ...

With best regards,
Roman




John J. Camilleri

unread,
Jun 25, 2020, 7:21:54 AM6/25/20
to gf-...@googlegroups.com
On Wed, 24 Jun 2020 at 22:20, Roman Suzi <roman...@gmail.com> wrote:
Hi John,

Regarding workarounds: Yes, I guess those will work, thanks. The problem with workarounds for feather_N is that it makes lexicon description more verbose for no obvious reason.

Well I wouldn't say they are workarounds, or that they are any better than your solution — I was just providing some additional information to help find the root of the problem. Something to do with type inference of overloads, to put it briefly, which feels like it could be fixable in GF core. We should create an issue for this.
 
As for problems with the order, they are not about changing 2-3 cases afterwards as in case of forest_N (it simply works as expected), but to describe the worst case: All cases, using this overload:

mkN : NounForms -> N
= \nf -> lin N nf ;

This one seems to work only when the order is exactly the same as in NounForms. I am not quite sure why there is such a limitation. I suspect type checking oddities as {pgen="YYY";pnom="XXX"} and {pnom="XXX";pgen="YYY"} should be of the same type.

Can you give a concrete example of this, together with what you expect and what you get? The best place for this is probably also an issue on GitHub.
 

Roman Suzi

unread,
Jun 28, 2020, 7:37:02 AM6/28/20
to Grammatical Framework
Hi!

I have difficulties making minimal test example for the feather_N / mkkN
trouble described above. Maybe you can better describe it more informatively for the gf-core developers?

As for

= \nf -> lin N nf ;
mkN : NounForms -> N


the example is when I am trying to make noun by just giving full record to that function, where the order of elements is different than in the type definition.

In ResRus I have:

NounForms : Type = {
snom, sgen, sdat, sacc, sins, sprep, sloc, sptv, svoc,
pnom, pgen, pdat, pacc, pins, pprep : Str ;
g : Gender ;
a : Animacy
} ;


In ParadigmsRus :

mkN = overload {
mkN : (nom : Str) -> N
= \nom -> lin N (guessNounForms nom) ;

mkN : Str -> Gender -> Animacy -> N
    = \nom, g, a -> lin N (guessLessNounForms nom g a) ;
mkN : Str -> Gender -> Animacy -> ZIndex -> N

= \word, g, a, z -> lin N (noMinorCases (makeNoun word g a z)) ;
  mkN : Str -> Gender -> Animacy -> Str -> N
= \word, g, a, zi -> lin N (noMinorCases (makeNoun word g a (parseIndex zi))) ;

mkN : NounForms -> N = \nf -> lin N nf ;
} ;

in lexicon (artificial example):

lin
animal_N = mkN {
snom="с"; sgen="с"; sdat="с"; sacc="с"; sins="с"; sprep="с"; sloc="с"; sptv="с"; svoc="с";
pnom="с"; pgen="с"; pdat="с"; pacc="с"; pins="с"; pprep="с";
g=Masc;
a=Inanimate
} ;


The error message:

LexiconRus.gf:
   LexiconRus.gf:6-11:
     Happened in linearization of animal_N
      no overload instance of mkN
      for
        {snom : Str; sgen : Str; sdat : Str; sacc : Str; sins : Str;
         sprep : Str; sloc : Str; sptv : Str; svoc : Str; pnom : Str;
         pgen : Str; pdat : Str; pacc : Str; pins : Str; pprep : Str;
         g : Gender; a : Animacy}
      among
        Str
        Str Gender Animacy
        Str Gender Animacy ZIndex
        Str Gender Animacy Str
        {a : Animacy; g : Gender; pacc : Str; pdat : Str; pgen : Str;
         pins : Str; pnom : Str; pprep : Str; sacc : Str; sdat : Str;
         sgen : Str; sins : Str; sloc : Str; snom : Str; sprep : Str;
         sptv : Str; svoc : Str}
      with value type {a : Animacy; g : Gender; pacc : Str; pdat : Str;
                       pgen : Str; pins : Str; pnom : Str; pprep : Str; sacc : Str;
                       sdat : Str; sgen : Str; sins : Str; sloc : Str; snom : Str;
                       sprep : Str; sptv : Str; svoc : Str}
>

Maybe, I am missing something like lock_N, but I do not really see whether it can even work like that, though that would be most natural way to indicate worst case with records.

With best regards,
Roman


On Thursday, June 25, 2020 at 2:21:54 PM UTC+3, John J. Camilleri wrote:


On Mon, 22 Jun 2020 at 06:39, Roman Suzi <roma...@gmail.com> wrote:
Hi!

This time I have some code for the next generation Russian RGL to show: https://github.com/rnd0101/gf-rgl-russian2

There is no support for anything but noun paradigms descriptions at the moment, plus it's not totally bug free, but it is based on a mainstream approach - so-called "Zalizniak algorithm" (I am using subset of Zalizniak index as it covers significant portion of Russian words. The algo itself covers 100%.) Indices can be easily found on wiktionary, individual words eg here: http://gramdict.ru/ 

GF proved to be a good tool for the job so far except for small nuisances:
1. It remains unclear to me why records depend on the order of constituents sometimes ( causing extra effort with describing "worst case" words https://github.com/rnd0101/gf-rgl-russian2/blob/master/src/russian2/ParadigmsRus.gf#L63 )
2. Sometimes explicit type needs to be indicated where I'd expected it to be deduced with type system (I do not remember where it was, in the middle of some function )
3. Not sure why it's not possible to take a record directly in concrete grammar, so some semi-dummy workaround with an extra function was needed like here: https://github.com/rnd0101/gf-rgl-russian2/blob/master/src/russian2/LexiconRus.gf#L60

Overall, I am surprised myself how far I was able to move into this. In my opinion it's a very clear improvement over what is now in Russian RG on all fronts. I used frequencies of appearances of certain declension styles to focus on, not Swadesh list, so picking uniformly random Russian word will likely work automatically. However, most frequent Russian nouns list is not covered 100% by automatics, which IMHO is also fine as it can be in the DictRus2 and finetuned manually. (old dictrus contains so many errors, that I'd just removed it)

I hope to do more. The main idea is, that paradigms should have more flexible connection with the rest. Even with Zalizniak it's clear that some Russian nouns will have declensions of adjectives (уп http://gramdict.ru/search/%D0%BF%D1%80%D0%BE%D0%B8%D0%B7%D0%B2%D0%BE%D0%B4%D0%BD%D0%B0%D1%8F ) - so the lexicon description references adjective-style declension type. Same for some pronouns, etc. So at the bottom of making word forms different abstractions are used, not parts of speech. (well, they mostly follow parts of speech, but not 100%). In other words, in Russian it may also well be that declension follows different gender than the word is considered to be in, so producing forms should be somewhat disconnected anyway.

Please, comment if something looks terribly wrong. (noun linearization is temporarily set to show all forms as debug). For example, I decided to include so-called minor cases explicitly instead of having special-purpose workarounds in the old RG, eg:

forest_N = (mkNplus (mkN "лес" masculine inanimate (ZC 1 No C ZC1))) ** {sloc="лесу"} ;

illustrates approach of having what was previously quite irregular word. While this kind of description looks complex, it's not complicated and is very straightforward from what


gives. (the ZC 1 No C ZC1 part is just an encoding of the index ( 1c(1) )  , and locative minor case also mentioned to be present in the dictionary entry.)
The only ugly part is that mkNplus, which I am not sure how to omit, but maybe it's even better this way. Many words can still be guesses with just: mkN "слово" , some more with adding gender/animacy. "Worst case" can be given with a record. I have not yet encountered words, which would require it.

With best regards,
Roman
PS. this is my personal project, not related to work

--

---
You received this message because you are subscribed to the Google Groups "Grammatical Framework" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gf-...@googlegroups.com.

--

---
You received this message because you are subscribed to the Google Groups "Grammatical Framework" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gf-...@googlegroups.com.

--

---
You received this message because you are subscribed to the Google Groups "Grammatical Framework" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gf-...@googlegroups.com.

Inari Listenmaa

unread,
Jun 28, 2020, 9:41:16 AM6/28/20
to gf-...@googlegroups.com
This looks definitely strange! 

Just a thought, could you try to annotate the nf argument like this?

mkN : NounForms -> N = \nf -> lin N <nf:NounForms> ;

I made a standalone example of the problem, all in one file. https://gist.github.com/inariksit/e6e69b324f6cf2051f35cafe910ea363 That <nf:NounForms> annotation didn't help there, but maybe there are some other things in play, like my file is just a resource, no lins anywhere.

Inari

To unsubscribe from this group and stop receiving emails from it, send an email to gf-dev+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gf-dev/26bc02f8-2f7d-4d3d-ace9-5b4af0fb8aedo%40googlegroups.com.

Roman Suzi

unread,
Jun 28, 2020, 10:13:10 AM6/28/20
to Grammatical Framework
hi Inari,

Does not work if I change in my larger setting. I've noticed that order of record items is random, maybe, this somehow prevents from finding the linearization?

This is when I moved Animacy and Gender first (I had gender first, animacy second)

LexiconRus.gf:
   LexiconRus.gf:6-12:

     Happened in linearization of animal_N
      no overload instance of mkN
      for
        Str {g : Gender; a : Animacy; snom : Str; sgen : Str; sdat : Str;

             sacc : Str; sins : Str; sprep : Str; sloc : Str; sptv : Str;
             svoc : Str; pnom : Str; pgen : Str; pdat : Str; pacc : Str;
             pins : Str; pprep : Str}
      among
        Str
        Str Gender Animacy
        Str Gender Animacy ZIndex
        Str Gender Animacy Str
        Str {a : Animacy; g : Gender; pacc : Str; pdat : Str; pgen : Str;
             pins : Str; pnom : Str; pprep : Str; sacc : Str; sdat : Str;
             sgen : Str; sins : Str; sloc : Str; snom : Str; sprep : Str;
             sptv : Str; svoc : Str}
      with value type {a : Animacy; g : Gender; pacc : Str; pdat : Str;
                       pgen : Str; pins : Str; pnom : Str; pprep : Str; sacc : Str;
                       sdat : Str; sgen : Str; sins : Str; sloc : Str; snom : Str;
                       sprep : Str; sptv : Str; svoc : Str}


I got it to work only when I matched some funny order the error message suggested (ignore the first str arg for a moment - it will be needed below)

animal_N = mkN "пример" {
a=Inanimate;
g=Masc;
pacc="с";
pdat="с";
pgen="с";
pins="с";
pnom="с";
pprep="с" ;
sacc="с";
sdat="с";
sgen="с";
sins="с";
sloc="с";
snom="с";
sprep="с";
sptv="с";
svoc="с";
};


So as I originally suggested, there possibly is some glitch in type-checking, which tries to compare records with some hash-table-induced order (wild guess here: I have not looked at implementation).

The glitch does not go away even if I make some dummy function:

dummyNounForms : Str -> NounForms -> NounForms = \s, nf -> guessNounForms s ;

mkN : Str -> NounForms -> N
= \s, nf -> lin N (dummyNounForms s nf) ;

-- this also does not work without order matching.


(also I noticed your example has explicit lock_N, and my does not. But it does not influence anything with the problem at hand)

-Roman

On Sunday, June 28, 2020 at 4:41:16 PM UTC+3, Inari wrote:
This looks definitely strange! 

Just a thought, could you try to annotate the nf argument like this?

mkN : NounForms -> N = \nf -> lin N <nf:NounForms> ;

I made a standalone example of the problem, all in one file. https://gist.github.com/inariksit/e6e69b324f6cf2051f35cafe910ea363 That <nf:NounForms> annotation didn't help there, but maybe there are some other things in play, like my file is just a resource, no lins anywhere.

Inari

Roman Suzi

unread,
Jun 28, 2020, 10:38:09 AM6/28/20
to Grammatical Framework
Or actually that is record keys' alphabetical order, which is for some reason is required.

-Roman

Roman Suzi

unread,
Jun 28, 2020, 12:14:29 PM6/28/20
to Grammatical Framework
hi!

One more hint. The **-operation "normalizes" the record, so the following starts to work:

(<> ** {... any order here ...})

This means, that record literals are not "normalized" (?) by default.

-Roman

Inari Listenmaa

unread,
Jun 28, 2020, 1:25:40 PM6/28/20
to Grammatical Framework
Oh, that's interesting! I'm learning new things about GF after using it for 10 years! :-D So does it work to just put <> ** nf in your mkN?

As for my explicit lock_N, those should not be used, instead "lin N" is preferred. (You see explicit lock_X fields in old resource grammars, but that's because they are old.) I just put a lock_N field in my type N, because I wanted to keep things simple and have less things to write, that is only opers and not lins. So technically an oper with lock_N  ≈ as if that oper (without the lock_N field) were on the RHS of a lincat. 

Slightly offtopic, but I have used tables for similar purposes before, and they don't have the same problem with ordering. Here's an example of giving a table as an argument to a constructor.

Flt = binop "<" (table {
   HWeight => "lighter than" ;
   HHeight|HLength => "shorter than" ;
   HAmountCount => "fewer than" ;
   HAmountMass => "less than" ;
   HSize => "smaller than" ;
   HSpeed => "slower than" ;
        _ => "less than"}) ;

Though maybe records are safer for resource grammar, you can't accidentally forget something and put a wildcard. Tables are also a bit more verbose, but that's a minor detail.

Inari

John J. Camilleri

unread,
Jul 1, 2020, 5:31:23 AM7/1/20
to gf-...@googlegroups.com
I've created two GitHub issues for these problems, where I've tried to find minimum working examples:

Type inference fails with overload and record extension

Order of record fields significant when using overload

In both cases it really seems that the use of overload causes the problem...

--

---
You received this message because you are subscribed to the Google Groups "Grammatical Framework" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gf-dev+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gf-dev/5749C727-2D1B-421F-83D6-3B704AF0390D%40gmail.com.

Roman Suzi

unread,
Jul 3, 2020, 4:22:18 PM7/3/20
to Grammatical Framework
Thanks a lot! As records are quite convenient constructs it could be good to use them without fancy workarounds.
- Roman
To unsubscribe from this group and stop receiving emails from it, send an email to gf-...@googlegroups.com.

Aarne Ranta

unread,
Jul 6, 2020, 12:03:16 PM7/6/20
to gf-...@googlegroups.com
Roman, John,

Thanks for raising the issue. These are bugs, not features, and it is great that they have been identified after so many years.

An intermediate report: 

- The order of fields was easy to solve and doesn't seem to have caused trouble anywhere. Pushed to Master in GitHub. Let me know if it causes problems.
- Type inference with record extension is more difficult. Not yet solved. One idea that looks like the thing to do in GF/Compile/TypeCheck/RConcrete.hs lines 522-523 does solve John's minimal example, but it breaks some RGL code for reasons that I don't quite understand. Pushed but commented out.

It is both fascinating and embarrassing to go back to code that I wrote mostly in 2006, with a few later patches here and there.

  Aarne.








To unsubscribe from this group and stop receiving emails from it, send an email to gf-dev+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gf-dev/b5f4d09c-23ee-4964-8473-ac7014b12a46o%40googlegroups.com.

Aarne Ranta

unread,
Jul 7, 2020, 3:18:28 AM7/7/20
to gf-...@googlegroups.com
I should of course add that the system has a complexity that I don't know from any other programming language. It combines

- overloading based on static typing
   - also allowing disambiguation on the value type, which is more than usual
   - also allowing partial application of functions
- subtyping, in particular records
- type inference
  - but not for lambda expressions at the moment: this is the issue I met yesterday, and which appeared in a nested overloading in ParadigmsAra
  - the reason it does not work for lambda is that GF is not polymorphic: e.g. (\x -> x) can only be type-inferred in a polymorphic system

Unlike the "basic GF" type system (in the JFP article 2004), this system has not been formally specified, but grown organically by addition of things that turned out useful in practice. Hence it can be the case that a complete and sound system combining all these features is not even possible: some formal work is needed to verify this.

For the time being, I cannot hence promise a solution of issue #66 that would not break anything else (in the legacy code). Some workarounds will perhaps always be needed to help overload resolution, and it is good to know the main ones

- type annotations of arguments of overloaded functions: <t : T>
- step-wise applications with local definitions, as in John's example TestRes.f5:  let r : R = f a b in r ** {s = c}   instead of   f a b ** {s = c}

I may be too pessimistic here, and someone can find a smarter solution than any of the ones I have been trying. But the reason for my cautiousness is the lack of formalization of the type system.

Maybe there are GF developers out there who get interested in this problem and can help us all out! One possible form of help would be counter-examples to a complete and sound type inference system combining all the features mentioned above. This should *not* imply that we restrict the GF compiler and reject some legacy code, but it would help clarify when we need workarounds such as type annotations.

  Aarne.















Reply all
Reply to author
Forward
0 new messages