Hi,
I am currently experimenting with GF-WordNet to set up an example which
- parses a German sentence
- removes ambiguity by selecting the "correct" abstract syntax/meaning
- makes translation to other languages.
e.g. "Die Frau singt" -> an abstract syntax tree such as PredVPS (DetCN (DetQuant DefArt NumSg) (UseN woman_1_N)) (MkVPS (TTAnt TPres ASimul) PPos (UseV sing_2_V)) — with woman_1_N and sing_2_V picked as the intended WordNet senses — which then linearizes to English "the woman sings", Spanish "la mujer canta", Italian "la donna canta".
For this I would love to use the C runtime:
the Haskell runtime enumerates all trees, which even for simple sentences explode to >1.000 leaves.
C runtime promises bounded n-best parsing which could help.
Also, I wanted to see how I can boil down the tree given that I know the "correct" abstract meaning of the nouns/verbs/adjectives involved.
So I went looking and found the majestic branch by looking at an issue (#130, "PGF as a database"), which looked very promising. I built the whole stack on WSL/Debian:
- C runtime (`src/runtime/c`, libpgf),
- Python binding (`src/runtime/python`),
- gf compiler (gf-4.0.0),
and downloaded the robust German grammar as an NGF.
`bootNGF`, `readNGF` and `lookupMorpho` work great and are *instant*
(mmap) — really nice.
The problem: parsing crashes. I un-commented `Concr.parse` in the binding wrapping `pgf_parse` to try and see if I could get it to run. It segfaults *inside the runtime* — even on the tiny `Food`
example compiled by the majestic `gf` itself (so it's not a format mismatch). The crashes seem to be in the LR machinery (`PgfParser::shift` dereferencing an invalid `shift->seq`, and a
`÷0` in `Production::operator new` where `lin->res.size()==0`).
`git log` on `parser.cxx` shows active work ("an experimental left-corner table maker", etc.), so it looks like the parser on `majestic` is still mid-rewrite / incomplete.
Could somebody give me some guidance?
3. How does `majestic` relate to the other C-runtime branches? We see several (`pgf2-complete`, `lpgf`/`lpgf-memo`/`lpgf-string`, `concrete-new`, `compact-pgf`, `c-runtime`, ...)
4. Is there a branch/commit where end-to-end *parsing* with the NGF format actually works or is it even planned to get the parser running on the `majestic` branch at some point?
5. Any guidance on getting the C runtime + NGF running for parsing (not just lookup/linearization) would be hugely helpful.
Thanks a lot for any pointers!
Martin