TAB breaks parsing/linearization identity

23 views
Skip to first unread message

bruno cuconato

unread,
May 4, 2018, 4:21:42 PM5/4/18
to Grammatical Framework
hello,

it seems that including a TAB character in a grammar will break the identity between parsing and linearization. 

consider the following example grammar, available as a gist:

abstract Test = {
 cat S ; -- entry
     Thing ;
 fun
   A, B : Thing ;
   mkTest : Thing -> Thing -> S ;

} ;


concrete TestConc of Test = open Prelude in {
 lincat
   S, Thing = SS ;
 lin
   A = ss "A" ;
   B = ss "B" ;
   mkTest a b = ss (a.s ++ "\t" ++ b.s) ;
} ;

will give the following error when generating trees, linearizing them, then parsing:
> i TestConc.gf
- compiling TestConc.gf...   write file TestConc.gfo
linking
... OK

Languages: TestConc
12 msec
Test> gr -number=3 | l -lang=Conc | p -lang=Conc
The parser failed at token 2: "A"
The parser failed at token 2: "B"
The parser failed at token 2: "B"
4 msec

if we change the TAB to another character (e.g., a pipe: "|"), everything works as expected.

-- bruno cuconato

Aarne Ranta

unread,
May 5, 2018, 3:37:41 AM5/5/18
to gf-...@googlegroups.com
Hello Bruno,

A very good observation!

The explanation is that GF parsing and linearization ignores all whitespace, except as separator of tokens.  Hence " " == "  " == "\t" == "\n" == "\t\n  \n\n\t" etc. From this follows that
- In the grammar compiler, an expression of type Str is hence normalized to a list of strings without whitespaces. 
- Linearization, therefore, can never produce specific whitespace characters, but just (by default) single ' ' characters to separate tokens.
- The parser will read as its input a sequence of tokens. 
  - in the GF shell, the "p" command expects the tokens to be separated by whitespace, more precisely, by any sequence defined by the regexp (' '|'\t'|'\n')+   
  - in the C runtime, special tokens such as BIND are inserted in accordance to the grammar, even if no whitespace is given in the input

The recommended way to control the whitespace characters is to encode them by some special tokens and use simple pre- and postprocessing. The ps command of the GF shell gives some support to this. In many applications, the proper way is to use a high level approach to layout, e.g. HTML tags or LaTeX macros.

Regards

  Aarne.




--

---
You received this message because you are subscribed to the Google Groups "Grammatical Framework" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gf-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

bruno cuconato

unread,
May 7, 2018, 6:12:17 PM5/7/18
to gf-...@googlegroups.com
thank you for the explanation!

indeed, I ended up doing the pre-/post-processing in order to achieve my goals.

I wonder if it'd be possible to configure this behaviour from the PGF API; using GF for things other than natural language -- which I gather is unorthodox -- could certainly profit from it! (and albeit uncommon, we do have the postscript example from the book, for instance!)

-- bruno cuconato
Reply all
Reply to author
Forward
0 new messages