Generating grammar-conformant texts from ANTLR grammars (or other way)?

712 views
Skip to first unread message

rtm...@googlemail.com

unread,
Apr 15, 2022, 6:16:02 AM4/15/22
to antlr-discussion

ANTLR eats text input per a given grammar, does anybody know a why of generating, randomly or systematically, texts (in parsing terminology, sentences)  from grammars?

Use case: I need to verify my thingy parses (and soon, evaluates) expressions identically to the way MSSQL does, so I need to produce a large number expressions, then parse and evaluate both using my parser/evaluator and MSSQL's and verify they give the same result. Systemic rather than random would be best.

In theory MSSQL's BNF is reliable. In practice I don't trust it and I know for a fact there is a weird corner case flaw in it (and even if it was perfect I will make mistakes)

Any ideas would be welcome.

thanks

jan

Mike Lischke

unread,
Apr 15, 2022, 8:03:53 AM4/15/22
to ANTLR discussion group
Hi Jan,

ANTLR eats text input per a given grammar, does anybody know a why of generating, randomly or systematically, texts (in parsing terminology, sentences)  from grammars?

In my VS Code extension (https://marketplace.visualstudio.com/items?itemName=mike-lischke.vscode-antlr4) I have a feature to generate sentences, even though it's not enabled yet, because it's a tricky thing. Non trivial grammars have a large number of alternatives, which potentiate each other quickly to an overall amount of possible sentences that is not manageable. So, the usual approach is to randomly select a rule alternative, guided by a weight value to ensure all alternatives are selected, if only enough sentences are generated.

I plan to enable that feature in the next release, which however needs still a bit work in a different area.


Use case: I need to verify my thingy parses (and soon, evaluates) expressions identically to the way MSSQL does, so I need to produce a large number expressions, then parse and evaluate both using my parser/evaluator and MSSQL's and verify they give the same result. Systemic rather than random would be best.

Note: testing a parser using sentences created from its grammar is not going to help you much (except for things like predicates, which change the parsing process independent of the grammar), because per definition such a parser and the generated sentences match, because they are generated from the same source. Better would be to use another source (like the BNF grammar) and generate sentences from that, which you can feed to your (ANTLR4 based) parser.


In theory MSSQL's BNF is reliable. In practice I don't trust it and I know for a fact there is a weird corner case flaw in it (and even if it was perfect I will make mistakes)

You will never be able to test all possible cases, so ...

rtm...@googlemail.com

unread,
Apr 15, 2022, 9:06:33 AM4/15/22
to antlr-discussion
Hi Mike,
see below.

On Friday, April 15, 2022 at 1:03:53 PM UTC+1 mike.l...@googlemail.com wrote:
Hi Jan,

ANTLR eats text input per a given grammar, does anybody know a why of generating, randomly or systematically, texts (in parsing terminology, sentences)  from grammars?

In my VS Code extension (https://marketplace.visualstudio.com/items?itemName=mike-lischke.vscode-antlr4) I have a feature to generate sentences, even though it's not enabled yet, because it's a tricky thing. Non trivial grammars have a large number of alternatives, which potentiate each other quickly to an overall amount of possible sentences that is not manageable. So, the usual approach is to randomly select a rule alternative, guided by a weight value to ensure all alternatives are selected, if only enough sentences are generated.

I'd thought that if a systematic way was done, it would do a breadth-first traversal through the grammar tree. Of course I haven't actually done it, or even thought much beyond what I've just written here.
 

I plan to enable that feature in the next release, which however needs still a bit work in a different area.


Use case: I need to verify my thingy parses (and soon, evaluates) expressions identically to the way MSSQL does, so I need to produce a large number expressions, then parse and evaluate both using my parser/evaluator and MSSQL's and verify they give the same result. Systemic rather than random would be best.

Note: testing a parser using sentences created from its grammar is not going to help you much (except for things like predicates, which change the parsing process independent of the grammar), because per definition such a parser and the generated sentences match, because they are generated from the same source. Better would be to use another source (like the BNF grammar) and generate sentences from that, which you can feed to your (ANTLR4 based) parser.

The idea is to use MSSQL as an 'oracle' for correctness. I know my grammar currently doesn't reflect the actual precedences of exprs parsed by MSSQL. If I generate sentences from my grammar, evaluate them, then pass the same expression to the MSSQL server as a string and have it evaluate that, I should get different results (I'll be worried if I don't) So I fix the grammar according to the BNF. Now they should be the same. So I re-run the generation + evaluation and can confirm they always agree.
 


In theory MSSQL's BNF is reliable. In practice I don't trust it and I know for a fact there is a weird corner case flaw in it (and even if it was perfect I will make mistakes)

You will never be able to test all possible cases, so ...

I don't have to create every possible expression, I will be happy with every possible combination of operators up to a given length, say 6 or 8. That will winkle out any discrepancy in precedence.

At the moment I guess I can do that with nested loops but later I think I'll be needing something more extensive 

cheers

jan

Mike Lischke

unread,
Apr 16, 2022, 5:09:21 AM4/16/22
to 'rtm...@googlemail.com' via antlr-discussion
In my VS Code extension (https://marketplace.visualstudio.com/items?itemName=mike-lischke.vscode-antlr4) I have a feature to generate sentences, even though it's not enabled yet, because it's a tricky thing. Non trivial grammars have a large number of alternatives, which potentiate each other quickly to an overall amount of possible sentences that is not manageable. So, the usual approach is to randomly select a rule alternative, guided by a weight value to ensure all alternatives are selected, if only enough sentences are generated.

I'd thought that if a systematic way was done, it would do a breadth-first traversal through the grammar tree. Of course I haven't actually done it, or even thought much beyond what I've just written here.

How can a breadth-first approach work here? You have to generate complete sentences, so you have to dive down first, before you can select another alternative for a second sentence. Of course you could stop at optional trailing parts, but that's just a special case. Technically, there's no problem to go through every alternative in a rule, but that will require to go through all paths in the ATN, which results in an unmanageable amount of possible sentences. Instead you pick one path randomly (guided by weights to ensure that already taken paths become less likely) to create kinda snapshot of the entire sentence space (which is usually unlimited, due to recursion).


rtm...@googlemail.com

unread,
Apr 16, 2022, 7:04:30 AM4/16/22
to antlr-discussion
Hi Mike,

On Saturday, April 16, 2022 at 10:09:21 AM UTC+1 mike.l...@googlemail.com wrote:
In my VS Code extension (https://marketplace.visualstudio.com/items?itemName=mike-lischke.vscode-antlr4) I have a feature to generate sentences, even though it's not enabled yet, because it's a tricky thing. Non trivial grammars have a large number of alternatives, which potentiate each other quickly to an overall amount of possible sentences that is not manageable. So, the usual approach is to randomly select a rule alternative, guided by a weight value to ensure all alternatives are selected, if only enough sentences are generated.

I'd thought that if a systematic way was done, it would do a breadth-first traversal through the grammar tree. Of course I haven't actually done it, or even thought much beyond what I've just written here.

How can a breadth-first approach work here?

In terms of quantity, DFT or BFT won't help, both will overwhelm you for any non-trivial grammar.

For my particular case I'm looking for a small number of expressions which cover an entire space, to check correctness. Real example, what does this evaluate to in MSSQL?:

select 8/4/2

is 1, as expected

how about

select 8/-4/2

gives -4. MSSQL's unary negation has a lower precedence than division so this is treated as 8/-(4/2)
 
You have to generate complete sentences, so you have to dive down first, before you can select another alternative for a second sentence.

Below is an example of a BFT doing a traversals of a grammar, which basically replaces the stack used in a DFT with the queue, which scans the tree in slices rather than in-depth (BFT), but I'm acutely aware that I'm probably talking to somebody who knows a far more about this than me. I guess I'll post it for the interest of others.
 
Of course you could stop at optional trailing parts, but that's just a special case. Technically, there's no problem to go through every alternative in a rule,

Using a DFT? I'm curious, how do you do that other than using a beam search or similar?
 
but that will require to go through all paths in the ATN, which results in an unmanageable amount of possible sentences. Instead you pick one path randomly (guided by weights to ensure that already taken paths become less likely) to create kinda snapshot of the entire sentence space (which is usually unlimited, due to recursion).

BFT/queue based traversal example of a tree. Again, sorry if this is ancient news to you.

Grammar is

A -> aB
B -> a | b | A
(start symbol is A)

Have a queue where stuff is pushed from the right and popped on the left

 stuff popped off  <- [queue]  <- stuff pushed

push start symbol A onto the queue

 [A]

Pop first item in queue (A) and push production of that (aB):

 [aB]

pop first (and currntly only item) and replace with all producctions:

 [aa, ab, aA]

pop first (aa), is finished sentence, output it. Pop second, ditto, you're left with

 [aA]

pop and replce with A's production:

 [aaB]

pop, replace with B's productions:

 [aaa, aab, aaA]

rinse and repeat.

Obviously where a stack's depth may grow infinitely for most recursive grammars (thought not this one), the queue's width will grow infinitely.

jan
 

rtm...@googlemail.com

unread,
Apr 16, 2022, 7:07:49 AM4/16/22
to antlr-discussion
I suppose you could, using a DFT, simply bound the depth and output only completely finished (= being comprised of terminals only) sentences. I guess that would cover a complete subset of the possible sentence space. That would be simpler.

jan

On Saturday, April 16, 2022 at 10:09:21 AM UTC+1 mike.l...@googlemail.com wrote:

Mike Lischke

unread,
Apr 16, 2022, 10:43:31 AM4/16/22
to 'rtm...@googlemail.com' via antlr-discussion
Hey Jan,

For my particular case I'm looking for a small number of expressions which cover an entire space, to check correctness. Real example, what does this evaluate to in MSSQL?:

select 8/4/2

is 1, as expected

how about

select 8/-4/2

gives -4. MSSQL's unary negation has a lower precedence than division so this is treated as 8/-(4/2)

ANTLR4 is particularly helpful for such cases. Just like for parsing, you can also start sentence generation for any rule and you will only get results for that single rule (all those used by it). That's perfect to limit the sentence space. Check this image:


This expr rule is super simple and generates (with a specific configuration) a number of pretty simple sentences. The special char  is used to denote that the maximum recursion level was reached (in this example it was set to 3). I need a better way for doing that, as the returned sentences must be parsable by the rule they were created for.


 Below is an example of a BFT doing a traversals of a grammar, which basically replaces the stack used in a DFT with the queue, which scans the tree in slices rather than in-depth (BFT), but I'm acutely aware that I'm probably talking to somebody who knows a far more about this than me. I guess I'll post it for the interest of others.

Using a BFT requires to hold pending productions/results in memory until everything is finished. Not a wise approach. Instead look how it is done in my sentence generator using a DFT: https://github.com/mike-lischke/vscode-antlr4/blob/master/src/backend/SentenceGenerator.ts#L243

The code walks through an ATN sequence (which corresponds to a single alt in a rule) and collects values, joining them together to a single result for that sequence. It also considers iterations and recursions (both are configurable). Additionally, semantic predicates are processed, but this is something unrelated here.

The entire algorithm produces only a single sentence and does not need to cache anything, except the result. At any point in the traversal only a single alt is taken, so there's no need to create a tree to keep productions for backtracking.

It's possible to specify how many sentences are to be created and the computed weights for each decision state is kept and used for all traversal invocations, allowing so to vary the generated sentences.

 
Of course you could stop at optional trailing parts, but that's just a special case. Technically, there's no problem to go through every alternative in a rule,

Using a DFT? I'm curious, how do you do that other than using a beam search or similar?

Beam search is a BFT, with the beam width determining how many paths you want to take. If you like you could call my approach a beam search with a beam width of 1, but that doesn't make so much sense, because even if I would allow for more than one alternative in a decision state (which I did in the beginning), I would still finish one alt before processing the next (depth-first).

 
Obviously where a stack's depth may grow infinitely for most recursive grammars (thought not this one), the queue's width will grow infinitely.

Resource consumption (including processing time) is a serious problem for more complex grammars, so I wanted to avoid caching anything. Additionally, my implementation allows to specify strings for any parser or lexer rule, which is useful, for example, for identifiers or even entire subrules which shall be presented by a single value, instead of including the full sub part for it. 


Here (my MySQL grammar) I used constant strings for numbers and quoted strings, to make the result more readable.

Another aspect is random Unicode character selection, which must be tailored in a way that takes out the overwhelming size of asian character blocks. But that's now going into to deeper details...

Mike Lischke

unread,
Apr 16, 2022, 10:47:10 AM4/16/22
to 'rtm...@googlemail.com' via antlr-discussion

I suppose you could, using a DFT, simply bound the depth and output only completely finished (= being comprised of terminals only) sentences. I guess that would cover a complete subset of the possible sentence space. That would be simpler.

That's the whole idea, yes. Intermediate results are only kept until the final result for one sentence is generated (which consists solely of terminals or fixed values, acting as terminals, using the mapping of rule names to simple strings, as mentioned in my previous mail).


Mike Lischke

unread,
Apr 16, 2022, 10:57:02 AM4/16/22
to 'rtm...@googlemail.com' via antlr-discussion
In terms of quantity, DFT or BFT won't help, both will overwhelm you for any non-trivial grammar.

Would be interesting to use machine learning for that (GPT-3?). But that's a completely different thing and requires something clever to train the neural network required for this. Ter has published a project a few years ago to format source code using machine learning, based on files that have already the correct format (https://github.com/antlr/codebuff). Something like that could be a starting point for sentence generation. 

Terence Parr

unread,
Apr 16, 2022, 12:27:15 PM4/16/22
to antlr-di...@googlegroups.com
Interesting discussion. I tried and gave up generating sentences from the grammar years ago but haven't looked at it since.
Ter
--
Dictation in use. Please excuse homophones, malapropisms, and nonsense.


--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antlr-discussion/5ac4c12b-bec3-45df-bb17-1792d337c8a7n%40googlegroups.com.

Terence Parr

unread,
Apr 16, 2022, 12:31:05 PM4/16/22
to antlr-di...@googlegroups.com
Machine learning, deep learning networks in particular, I've gotten really good at generating human text but I think a lot of what comes out of the networks for programming code has to be thrown out because it's not valid. On the other hand you just have to check the output and filter for the good stuff.

In the case of generating code from grammar, we needed to be specifically valid for the grammar so maybe a hybrid approach would work. Essentially we would need to create a language model that understood the probability of seeing a particular token at a particular point in an input stream then verify it's valid according to the grammar emit it. Something like that should work

Ter

Dictation in use. Please excuse homophones, malapropisms, and nonsense.


--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.

rtm...@googlemail.com

unread,
Apr 16, 2022, 2:18:25 PM4/16/22
to antlr-discussion
Hi Mike,
thanks for this, I'll get back to you properly in a couple of days. This looks very interesting.

Just a clear something up, I led you up the garden path talking about beam search. I've no idea where that came from, I actually meant iterative deepening <https://en.wikipedia.org/wiki/Iterative_deepening_depth-first_search>, sorry about that!

jan

rtm...@googlemail.com

unread,
Apr 16, 2022, 2:21:20 PM4/16/22
to antlr-discussion
Hi Ter

On Saturday, April 16, 2022 at 5:31:05 PM UTC+1 Terence Parr wrote:
Machine learning, deep learning networks in particular, I've gotten really good at generating human text but I think a lot of what comes out of the networks for programming code has to be thrown out because it's not valid. On the other hand you just have to check the output and filter for the good stuff.

In the case of generating code from grammar, we needed to be specifically valid for the grammar so maybe a hybrid approach would work. Essentially we would need to create a language model that understood the probability of seeing a particular token at a particular point in an input stream then verify it's valid according to the grammar emit it. Something like that should work

Not that I've ever used them but that does sound like markov chains, are those remotely relevant?

jan
 

rtm...@googlemail.com

unread,
Apr 16, 2022, 3:00:41 PM4/16/22
to antlr-discussion
That was a dumb suggestion, skip it

jan

Mike Lischke

unread,
Apr 17, 2022, 3:20:13 AM4/17/22
to antlr-di...@googlegroups.com
Machine learning, deep learning networks in particular, I've gotten really good at generating human text but I think a lot of what comes out of the networks for programming code has to be thrown out because it's not valid. On the other hand you just have to check the output and filter for the good stuff.

In the case of generating code from grammar, we needed to be specifically valid for the grammar so maybe a hybrid approach would work. Essentially we would need to create a language model that understood the probability of seeing a particular token at a particular point in an input stream then verify it's valid according to the grammar emit it. Something like that should work

I don't know much about machine learning, but I hoped it would be possible to define/learn configurations of the NN based on the grammar. After all this is a very strict description of a language, not like natural language. On the other hand the effort to train such a network might be way too much for the outcome.


Terence Parr

unread,
Apr 17, 2022, 4:32:03 PM4/17/22
to antlr-di...@googlegroups.com
hi Jan, yep, stochastic networks like markov chains are basically how you would choose new sentence is based on the grammar. Somehow we would need to come up with the probability of seeing token T given previous history in the network/grammar/ATN.
Ter

Dictation in use. Please excuse homophones, malapropisms, and nonsense.


Ken Domino

unread,
Apr 17, 2022, 8:49:45 PM4/17/22
to antlr-discussion
For what it's worth, one should read Grammarinator (https://github.com/renatahodovan/grammarinator https://dl.acm.org/doi/abs/10.1145/3278186.3278193). It is a practical Python application that inputs an Antlr4 grammar, max tree depth, and the number of strings to generate, and output strings contained in separate files. It is a worthwhile read as the author does several clever things in the implementation. But, there are several issues I have with Grammarinator: (1) While all strings are syntactically valid, the strings do not look anywhere near what a normal human being would write. (2) It doesn't have an easy way to "plug-in" a "model" used to guide the left-most-depth-first derivation. In my opinion, it should somehow be derived from a corpus, which I am investigating. (3) There's a wide range of string sizes generated, half of which are empty for one grammar that I tried (see https://github.com/renatahodovan/grammarinator/issues/43). (4) Using the maximum depth of a tree is a very indirect way of sizing the generated strings. Purdom, the grand-daddy of sentence generators, had that right. https://link.springer.com/article/10.1007/BF01932308 --Ken

rtm...@googlemail.com

unread,
Apr 18, 2022, 1:07:11 AM4/18/22
to antlr-discussion
Yup, but choosing a symbol at weighted random is the easy part, and doesn't touch on the validity of the sentence produced which is why I withdrew that.

I suppose the question then becoems, what is a 'valid' sentence? Im a more general case than mine (that being just generating all possible expressions no larger than n operators and feeding them ints and floats) that question has to be answered up-front.

FYI for typesafe output there exists something I've vaguely heard of, Van Wijngaarden grammars <https://en.wikipedia.org/wiki/Van_Wijngaarden_grammar>, meta-grammars thet produce grammars that produce sentences

"For example, the assignment x:=1 is only valid if the variable x can contain an integer. Therefore, the context-free syntax 'variable := value' is incomplete. In a two-level grammar, this might be specified in a context-sensitive manner as 'REF TYPE variable := TYPE value'. Then 'ref integer variable := integer value' could be a production rule but 'ref Boolean variable := integer value' is not a possible production rule. This also means that assigning with incompatible types becomes a syntax error which can be caught at compile-time."

FYI anyway.  I don't think antlr supports these yet :)

jan

Terence Parr

unread,
Apr 18, 2022, 12:57:16 PM4/18/22
to antlr-discussion
Excellent points, Jan! Yeah, context free grammars really never handle the semantic issues of variable definitions, at least without predicates and that is very informal. There are definitely grammar Systems that try to do this.
Ter



--

Terence Parr

unread,
Apr 18, 2022, 1:01:52 PM4/18/22
to antlr-discussion
Thanks for the pointer, Ken! That does seem like a really cool tool and a good starting point. Just tweeted it.
Ter



--
Reply all
Reply to author
Forward
0 new messages