ANTLR eats text input per a given grammar, does anybody know a why of generating, randomly or systematically, texts (in parsing terminology, sentences) from grammars?
Use case: I need to verify my thingy parses (and soon, evaluates) expressions identically to the way MSSQL does, so I need to produce a large number expressions, then parse and evaluate both using my parser/evaluator and MSSQL's and verify they give the same result. Systemic rather than random would be best.
In theory MSSQL's BNF is reliable. In practice I don't trust it and I know for a fact there is a weird corner case flaw in it (and even if it was perfect I will make mistakes)
Hi Jan,ANTLR eats text input per a given grammar, does anybody know a why of generating, randomly or systematically, texts (in parsing terminology, sentences) from grammars?In my VS Code extension (https://marketplace.visualstudio.com/items?itemName=mike-lischke.vscode-antlr4) I have a feature to generate sentences, even though it's not enabled yet, because it's a tricky thing. Non trivial grammars have a large number of alternatives, which potentiate each other quickly to an overall amount of possible sentences that is not manageable. So, the usual approach is to randomly select a rule alternative, guided by a weight value to ensure all alternatives are selected, if only enough sentences are generated.
I plan to enable that feature in the next release, which however needs still a bit work in a different area.
Use case: I need to verify my thingy parses (and soon, evaluates) expressions identically to the way MSSQL does, so I need to produce a large number expressions, then parse and evaluate both using my parser/evaluator and MSSQL's and verify they give the same result. Systemic rather than random would be best.Note: testing a parser using sentences created from its grammar is not going to help you much (except for things like predicates, which change the parsing process independent of the grammar), because per definition such a parser and the generated sentences match, because they are generated from the same source. Better would be to use another source (like the BNF grammar) and generate sentences from that, which you can feed to your (ANTLR4 based) parser.
In theory MSSQL's BNF is reliable. In practice I don't trust it and I know for a fact there is a weird corner case flaw in it (and even if it was perfect I will make mistakes)You will never be able to test all possible cases, so ...
In my VS Code extension (https://marketplace.visualstudio.com/items?itemName=mike-lischke.vscode-antlr4) I have a feature to generate sentences, even though it's not enabled yet, because it's a tricky thing. Non trivial grammars have a large number of alternatives, which potentiate each other quickly to an overall amount of possible sentences that is not manageable. So, the usual approach is to randomly select a rule alternative, guided by a weight value to ensure all alternatives are selected, if only enough sentences are generated.
I'd thought that if a systematic way was done, it would do a breadth-first traversal through the grammar tree. Of course I haven't actually done it, or even thought much beyond what I've just written here.
In my VS Code extension (https://marketplace.visualstudio.com/items?itemName=mike-lischke.vscode-antlr4) I have a feature to generate sentences, even though it's not enabled yet, because it's a tricky thing. Non trivial grammars have a large number of alternatives, which potentiate each other quickly to an overall amount of possible sentences that is not manageable. So, the usual approach is to randomly select a rule alternative, guided by a weight value to ensure all alternatives are selected, if only enough sentences are generated.
I'd thought that if a systematic way was done, it would do a breadth-first traversal through the grammar tree. Of course I haven't actually done it, or even thought much beyond what I've just written here.How can a breadth-first approach work here?
You have to generate complete sentences, so you have to dive down first, before you can select another alternative for a second sentence.
Of course you could stop at optional trailing parts, but that's just a special case. Technically, there's no problem to go through every alternative in a rule,
but that will require to go through all paths in the ATN, which results in an unmanageable amount of possible sentences. Instead you pick one path randomly (guided by weights to ensure that already taken paths become less likely) to create kinda snapshot of the entire sentence space (which is usually unlimited, due to recursion).
For my particular case I'm looking for a small number of expressions which cover an entire space, to check correctness. Real example, what does this evaluate to in MSSQL?:
select 8/4/2
is 1, as expected
how about
select 8/-4/2
gives -4. MSSQL's unary negation has a lower precedence than division so this is treated as 8/-(4/2)

Below is an example of a BFT doing a traversals of a grammar, which basically replaces the stack used in a DFT with the queue, which scans the tree in slices rather than in-depth (BFT), but I'm acutely aware that I'm probably talking to somebody who knows a far more about this than me. I guess I'll post it for the interest of others.
Of course you could stop at optional trailing parts, but that's just a special case. Technically, there's no problem to go through every alternative in a rule,
Using a DFT? I'm curious, how do you do that other than using a beam search or similar?
Obviously where a stack's depth may grow infinitely for most recursive grammars (thought not this one), the queue's width will grow infinitely.

I suppose you could, using a DFT, simply bound the depth and output only completely finished (= being comprised of terminals only) sentences. I guess that would cover a complete subset of the possible sentence space. That would be simpler.
In terms of quantity, DFT or BFT won't help, both will overwhelm you for any non-trivial grammar.
--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antlr-discussion/5ac4c12b-bec3-45df-bb17-1792d337c8a7n%40googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antlr-discussion/4625709A-3B0A-4DC9-BE99-C8E6A6837A77%40googlemail.com.
Machine learning, deep learning networks in particular, I've gotten really good at generating human text but I think a lot of what comes out of the networks for programming code has to be thrown out because it's not valid. On the other hand you just have to check the output and filter for the good stuff.In the case of generating code from grammar, we needed to be specifically valid for the grammar so maybe a hybrid approach would work. Essentially we would need to create a language model that understood the probability of seeing a particular token at a particular point in an input stream then verify it's valid according to the grammar emit it. Something like that should work
Machine learning, deep learning networks in particular, I've gotten really good at generating human text but I think a lot of what comes out of the networks for programming code has to be thrown out because it's not valid. On the other hand you just have to check the output and filter for the good stuff.In the case of generating code from grammar, we needed to be specifically valid for the grammar so maybe a hybrid approach would work. Essentially we would need to create a language model that understood the probability of seeing a particular token at a particular point in an input stream then verify it's valid according to the grammar emit it. Something like that should work
To view this discussion on the web visit https://groups.google.com/d/msgid/antlr-discussion/f12f3939-0675-4336-b23e-c84bbd69c7b8n%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antlr-discussion/c2cf5bc6-81a9-4355-979e-14aefcf4ebe6n%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antlr-discussion/f87b251b-12c3-4770-9c84-4b51a6641e16n%40googlegroups.com.