Avoid SKIPping while producing another terminal

Cesare Zecca

unread,

Aug 16, 2007, 11:27:50 AM8/16/07

to cze...@ads.it

Hi all

I took the NL_Xlator grammar (see javacc-4.0\examples\SimpleExamples
\NL_Xlator.jj) as starting point for my grammar (named GpL) for a
relatively simple "arithmetic" language enriched with a limited set of
other features.
Currently the application has some entities with a string description
and its micro grammar
(identifiers, string and number literals, group names, and so on).
Current entities have their hand-coded token analyzers which have been
already unit tested.
E.g.
The Identifiers ("names" in the application jargon) do correspond to
the token

and have their domain class NameDom with its method canHadfle() that
analyzes if the provided string does respect the grammar rule quoted
just above.
The same holds for other enties (terminals).

I have the first skeleton of GpL.jj NL_XLator derived grammar.
I wrote the terminal

TOKEN :
{ <ID : ["a"-"z", "A"-"Z","_"] ( ["a"-"z", "A"-"Z", "_", "0"-"9"] )* >
| <NUM: (["0"-"9"])+ >
| <STRING_LITERAL
: "\"" ( ~["\"","\\","\n","\r"]
| "\\" ( ["n","t","r","f","\\","\'","\""] )?
| ( ["\n","\r"] | "\r\n" )
)* "\""
>
}

and some non-terrminal rules (see javacc-4.0\examples\SimpleExamples
\NL_Xlator.jj)

ExpresssionList
-> Expression
-> Term
-> Factor()

To test a single, specific rule I isolated an Id() rule to call it
from the unit test class

String
Id() :
{
Token lToken;
String lResult;
}
{
lToken = <ID>
{
lResult = lToken.image;
return lResult;
}
} // end GpL.Id()

Well, to test specifically the <ID> / Id() rule for identifiers, I
called the GpL.Id() method from the UT class and it would seem to
work.
Identifiers as

Call: Id
Consumed token: <<ID>: "myName_1_isOk" at line 1 column 1>
Return: Id

Call: Id
Consumed token: <<ID>: "_Foo2_boo2_goo2" at line 1 column 1>
Return: Id
are may others are all correctly "approved".

Instead the

"foo2 boo2"

string (that is NOT an <ID>) is wrogly approved (no ParseExpceptin is
thrown)

GpL.jj does contain the production

SKIP :
{ " "
| "\t"
| "\n"
| "\r"
}

as well, but I would like that no skipping would be performed during
the <ID> / Id() rule.

Any suggestion?

Thanks in advance for any reply.

Ciao
Cesare

AC

unread,

Aug 18, 2007, 7:42:40 AM8/18/07

to

This looks like it might be accepting just a prefix of the input.

Is the parsed Id "foo2" when parsing "foo2 boo2"?

The code of the test case is not shown. Maybe it just calls Id(),
which reads the next <ID> token but does not force that to be the end
of the input. Since there is a valid <ID> in a prefix of the input,
it accepts that much, and would parse the other id if Id() was called
again. A production must end with the <EOF> token to force the
grammar to accept the entire input and not a prefix.

To create unit tests of productions smaller than the top level
production (with <EOF>):

(a) add test productions with <EOF> to the grammar. If Id() is a test
production, then it would contain
<ID> <EOF>

Or

(b) add an Eof() production that contains <EOF> that you can call after
the tested production from a test method outside the grammar as follows:
GpL.Id();
GpL.Eof();

Hope this helps!

Cesare Zecca

unread,

Aug 20, 2007, 8:35:20 AM8/20/07

to

On 18 Ago, 13:42, AC <u...@domain.invalid> wrote:

Vielen Danke, AC!
:)

> This looks like it might be accepting just a prefix of the input.
>
> Is the parsed Id "foo2" when parsing "foo2 boo2"?

Yes, "foo2" was the parsed id.

> The code of the test case is not shown

Well basically here below is the code (a first rough, prototype).
I use reflection to get the GpL.method() (rule id) given its name, in
the applyRule() method of the GpLProxy class

public
ParseException
applyRule
( String pMethodName // rule id
, String pInput
)
{
ParseException lResult = null;

Reader lReader = toReader( pInput );
GpL.ReInit( lReader );
Method lMethod = toMethod( pMethodName );
try
{
lMethod.invoke( (Object) null, (Object[])null );
GpL.Eof(); // this ha been just added accordingly to your
suggestion
}
catch ( ... )
...

The client code (unit tester) is and originally was (pre JavaCC) like

assertNull( lDom.canHandle( "myName_1_isOk" ) );
assertNull( lDom.canHandle( "_Foo2_boo2_goo2" ) );
...
assertNotNull( lDom.canHandle( "foo2 boo2" ) );
assertNotNull( lDom.canHandle( "#foo" ) );
assertNotNull( lDom.canHandle( "@mail.com" ) );
...

NameDom.canHandle() has been modified to use GpLProxy and now is the
more or the less like

...
String lRuleId = "Id";
ParseException lParseException =
GpLProxy.instance().applyRule( lRuleId, pString );
if ( lParseException != null )
lResult = new DomainException( NameDom.class,
lParseException.getMessage() );
else
lResult = null;
...

> [...] A production must end with the <EOF> token to force the

> grammar to accept the entire input and not a prefix.
>
> To create unit tests of productions smaller than the top level
> production (with <EOF>):
>
> (a) add test productions with <EOF> to the grammar. If Id() is a test
> production, then it would contain
> <ID> <EOF>

Thanks.
All this raises the issues of how would it be possible to handle a
"production" and a "development(&test)" version in a single file. But
this is another (engineering) matter. For the time being I opted for
the following suggested solution (it allows to bound references to EOF
only to the UT code).

> Or
>
> (b) add an Eof() production that contains <EOF> that you can call after
> the tested production from a test method outside the grammar as follows:
> GpL.Id();
> GpL.Eof();
>
> Hope this helps!

It greatly helped! I've added the GpL.Eof() call (see above).
With the "foo2 boo2" input now the diagnostic is

...gpj.ec.istruzioni.gpl.ParseException: Encountered "boo2" at line 1,
column 6.
Was expecting:

There are other issues, here
I)
I have to change GplProxy.applyRule() because not only ParseException-
s can be thrown but TokenMgrError-s as well
(as I've just realized while going some steps furher with the unit
testing, see the "#foo" test case)

II)
- the diagnostic seems to be incomplete, at a first glance (what is
the parser expecting? I dont'see anything...)
Instead ParseException.getMessage() goes trough

if (expectedTokenSequences.length == 1) {
retval += "Was expecting:" + eol + " ";

(the final spaces do not "appear").

III)
Rather, I would have preferred a diagnostic like

[%] Lexical error at line 1, column 6. Encountered: " " (35),
after : "...oo2"

or something similar to that. The "#foo" test case exerts the system
and seems to make work it at the lexical level with a better
diagnostic.
If i've guessed well I have to find how to make work the token manager
(instead of the parser) to get a lexical,
more specific diagnostic (see [%] above), i.e., programmatically
speaking, to get a TokenMgrError instead of a ParseException. But this
is still, for me as beginner, an(other) open issue.

Thanks for the help.

ciao
Cesare

Cesare Zecca

unread,

Aug 21, 2007, 1:25:51 PM8/21/07

to

On 18 Ago, 13:42, AC <u...@domain.invalid> wrote:

> The code of the test case is not shown. Maybe it just calls Id(),
> which reads the next <ID> token but does not force that to be the end
> of the input. Since there is a valid <ID> in a prefix of the input,
> it accepts that much, and would parse the other id if Id() was called
> again. A production must end with the <EOF> token to force the
> grammar to accept the entire input and not a prefix.

Well, even forcing the grammar to accept the whole input, the (u)test
case

"civico n° 4"

(that is not an identifier, given the ID regexpr) leads the system to
the following diagnostic

...gpl.ParseException: Encountered "n" at line 1, column 8.
Was expecting:

It would seem that the generated parser skips the white space and the
finds another (correctly generated but unexpected) token "n".
Well, forcing EOF does not seem to force the system to use the ID/Id
lexical rule and to diagnostic the SP as unexpeccted character at
lexical level.
Even if I got a diagnostic that reports a problem, it's not formulated
in terms of the ID lexical rule.
Another symptom is that the "Was expecting" section continues not to
report anything, I guess.

I guess that I should work at lexical level.
But how can I directly exert a regexpr (ID in this case) from the
client (unit testing) code (instead of calling a nonterminal, as
currently done via the Id() method?)

AC

unread,

Aug 24, 2007, 7:44:08 AM8/24/07

to

Here's a sample showing another approach to writing unit tests for
JavaCC tokens and productions.

As you suggested, for testing a token the test code gets tokens
directly rather than calling a production. In general the token
manager does not use context to choose which token to create, so the
test code gets whatever next token it produces and then checks that it
was the correct kind. Since you were concerned with skipped space
characters, for tokens the test code also checks that token starts at
the beginning of the test input and ends at the end of the test input.

For testing a production, it gets the start token, calls the
production, gets the end token, and collects a list of all the tokens
between them parsed by the production. Then it compares this list to
the expected list of token strings. For testing a production, skipped
characters like spaces are not significant except to separate tokens.
If the production does not end with <EOF>, the result may be a prefix
of the input, but that is ok since it compares the list of tokens
parsed.

The sample includes example tests but not complete coverage of the
grammar. The 'demo' sample tests show what failure messages look
like. Both token ID and production Id are shown so you can compare
some differences between testing a token and testing a production that
just calls the token, in particular the treatment of spaces and
legal prefixes.

Two versions are included:
- SampleInstanceGrammar shows the approach for nonstatic grammars,
- SampleStaticGrammar shows the approach for static grammars

Ideas for possible additional features not implemented here:
- if grammar uses token states, add token state parameter to token tests
- if productions return values, add return value to production tests
- for concurrent testing of instance parser, say for a large number or
large test cases, create a separate parser for each test rather
than reusing parser, and run in separate tasks using a thread pool.
Or you could create a parser for each group of tests run sequentially.
(Some unit test frameworks enable you to do something like this if you
use one instead of the runAllTests method.)

Hope this helps!

SampleInstanceGrammarWithUnitTests.jj

SampleStaticGrammarWithUnitTests.jj

Message has been deleted

AC

unread,

Aug 24, 2007, 7:51:13 AM8/24/07

to

Or create a parser for each group of tests run sequentially.

SampleInstanceGrammarWithUnitTests.jj

SampleStaticGrammarWithUnitTests.jj

Cesare Zecca

unread,

Aug 28, 2007, 11:03:43 AM8/28/07

to

On 24 Ago, 13:51, AC <u...@domain.invalid> wrote:

> As you suggested, for testing a token the test code gets tokens
> directly rather than calling a production. In general the token
> manager does not use context to choose which token to create, so the
> test code gets whatever next token it produces and then checks that it
> was the correct kind. Since you were concerned with skipped space
> characters, for tokens the test code also checks that token starts at
> the beginning of the test input and ends at the end of the test input.

That works fine.
I rearranged the

void runTokenTestCase( int tokenKind, String input, String expect )

method into the GplProxy's method

Throwable applyTokenRule( int tokenKind, String input )

The expect argument has been removed and applytokenRule() returns
diagnostic functionally (instead of using throwing mechanisms).
Basically any client code should be able to run applytokenRule() (e.g.
some methods to check database integrity) not only unit testers. That
is, a-priori we know an expected result only seldom. Most of the times
the job is:

I have such a string: does it comply with the tokenKind (lexical)
rule?
If it does not, please, return the diagnostic.

The functional approach allows those methods to be used in boolean
expression (e.g in contract clauses)

[...]

It greatly helped.
For the time being I work on this stuff; finer things such
optimizations and testing productions will be faced soon.
Thanks, AC!

Cesare Zecca

unread,

Sep 3, 2007, 11:17:19 AM9/3/07

to

Let me continue this discussion, since the object "Unit tests for
JavaCC tokens and productions" tells that is the more suitable place.

On 1 Set, 14:06, AC <u...@domain.invalid> wrote:
http://groups.google.com/group/comp.compilers.tools.javacc/browse_thread/thread/10f80413ad4af3c7/4524e644fe3885e0#4524e644fe3885e0

> > ID and GROUP_ID share a common prefix and cause a lookahead problem.
>
> Actually, the problem is that there are two ways to parse an <ID>,
> either
> Factor() -> Id() -> <ID>
> or
> Factor() -> GroupId() -> <ID>
> The parser doesn't know which one to choose.
>
> For the example grammar, this is unnecessary, so the fix is to
> eliminate the redundancy.

>
> Approach 1: Eliminate Id() and GroupId() [...] Then <ID> would be parsed via
> Factor() -> <ID>
>
> Approach 2: Remove <ID> from GroupId(), [...] Then <ID> would be parsed via
> Factor() -> Id() -> <ID>

Well, the issue is related to the unit testing (of legacy code) and
reuse of the grammar for other checks; it should be possible to call
either Id() or GroupId() distictly and indipendently (the former to
check, e.g. the integrity of names (identifiers) on the db, the latter
to check integrity of group names (group identifiers).

(besides:
after

void runTokenTestCase( int tokenKind, String input, String expect )

i've defined, on the model guide of void runProductionTestCase(...)
(see above)

public
GplProductionException
runProductionRule
( String pMethodName
, String pInput
, List<String> pExpectedTokenImages
)

to test production, as well, and it works fine with group
indentifers!).

Well, if I have guessed well, when a number of tokes are involved, to
avoid the problems concerning the skipping, it would be better to
check productions and this is the case of group ids (well, currently I
do not remember exactly all the reasons because to check group
identifiers I had to use runProductionTestCase() alias
runProductionRule() instead of the corresponding token-oriented
methods).

Anyway, given my lack of expertise, I'm not able yet to provide a rule
of thumb to distinguish, externally, for a less-than-expert-user (or,
better, a programmer that does not know JavaCC details) the need to
call either production or token based check methods.
In some way I would be glad to encapsulate such complexity to a unit
test programmer, preventing the client (code/programmer) to check both
for Id()s and GroupId() when its purpose is, simply, to check group
identifiers.

let me quote here a couple od sentences which provided a hint to me
(cf. lookahead tutorial https://javacc.dev.java.net/doc/lookahead.html)

A design decision must be made to determine if Option 1 (left
factoring) or Option 2 (lookahead specification) is the right one to
take (...) However, the advantage of choosing Option 2 is that you
have a simpler grammar - one that is easier to develop and maintain -
one that focuses on human-friendliness and not machine-friendliness.
(...)
Sometimes Option 2 is the only choice - especially in the presence of
user actions.

Shoud we add "or if you need to call punctually one of more specific
production and are prevented to factor them".
Do I gues well?

ciao
Cesare

P.S.
Surely there are some unskilfulness and some process driven
constrained decisions in my method as well. Firstly I must be sure
that I can replace calls to the handcrafted parser in the unit tests
with calls to the JavaCC machinery, without introducing any
regression.
After I would want to expand the grammar to complete it with the yet
missing constructs. This is the reason because of I reduced the
grammar to Id()s and GroupId()s, before all. Maybe it's not the better
approach.

AC

unread,

Sep 4, 2007, 7:59:42 AM9/4/07

to

Here are some more approaches to consider.

It sounds like you want to keep Id() and GroupId().
In the example given, Factor does not depend on the choice of Id() or
GroupId(), so since GroupId() covers Id(), Id() could be removed from
Factor().

Approach 3a: Remove Id() from Factor. Then <ID> would be parsed via
Factor() -> GroupId() -> <ID>

Approach 3b: Remove Id() from Factor and replace <ID> with Id() in
GroupId(). Then <ID> would be parsed via
Factor() -> GroupID() -> Id() -> <ID>

You mentioned that you had to use a production test for group id.
Was it because GroupID was returning an <ID> rather than a <GROUP_ID>
for some test cases, so the token kind did not match?

Approach 4: extend the token test cases to permit multiple expected
token kinds. Maybe something like the following (untested):

void runTokenTestCase(int[] tokenKinds, String input, String expect){
sampleInstanceParser.ReInit(new StringReader(input));
String result = null;
Throwable error = null;
try {
Token token = sampleInstanceParser.getToken(1); /* peek */
Arrays.sort(tokenKinds);
if (Arrays.binarySearch(tokenKinds, token.kinds) < 0) {
throw new ParseException(sampleInstanceParser.getToken(0),
tokenKinds,
SampleInstanceParserConstants.tokenImage);
}
result = token.image;
if (token.beginColumn != 1) // column is 1-based
throw new TestInputIsNotATokenException(result, 1,
input.charAt(0));
if (token.endColumn < input.length()) // column is 1-based
throw new TestInputIsNotATokenException(result, token.endColumn+1,

input.charAt(token.endColumn));
} catch (Throwable t) {
error = t;
result = t.getClass().getName();
}
reportResult(SampleInstanceParserConstants.tokenImage[tokenKind],
input, Collections.<String>singletonList(expect),
Collections.<String>singletonList(result), error);
}

You seemed to be worried that the unit test writer cannot easily tell
that a GroupId could be an <ID> or a <GROUP_ID>, so calling the token
checker with just kind GROUP_ID can fail.

Approach 5: eliminate <ID> and use <GROUP_ID> for both Id() and
GroupId(). Add a check to Id() to make sure it doesn't contain any
group separator characters.

TOKEN : // Group identifiers
{ <GROUP_ID : <ID> ( <GROUP_ID_SEPARATOR> <ID> )* >
| <#ID : ["a"-"z", "A"-"Z","_"] ( ["a"-"z", "A"-"Z", "_", "0"-"9"]
)* >
| <#GROUP_ID_SEPARATOR: "." >
}

// ...

String
Id() : {
Token lToken;
String lResult;
} {

lToken = <GROUP_ID>
{
lResult = lToken.image;
if (lResult.indexOf('.') != -1) {
throw new ParseException("'.' in Id "+lResult)
}
return lResult;
}
} // end Gpl.Id()

String
GroupId() : {
Token lToken;
String lResult;
} {
lToken = <GROUP_ID>

{
lResult = lToken.image;
return lResult;
}
}

This means a single token type, GROUP_ID, would be used for both Id
and GroupId, so the token checker would not be able to detect an error
when a GroupId is used for an Id, only the production checker could do
that.

Approach 6: use token manager states. The tokenizer does not use
context, so it returns the first token kind that matches, so in the
grammar where <ID> and <GROUP_ID> both match an undotted name, it
returns <ID>. (Lookahead only works for production ambiguities, not
token ambiguities.) If you want the same string to be an <ID> in some
cases and a <GROUP_ID> in other cases, one approach is to use token
manager states, though that can make the grammar much more
complicated, depending on how easy it is to specify each context.
There must be some token (or one of several) that precede it that
always mark the start of the context, and either the id itself or some
token afterward that mark the end of that context. This approach
would also require adding another parameter to each test case to
specify which token state to use.

I hope one of these helps...

Cesare Zecca

unread,

Sep 4, 2007, 11:24:38 AM9/4/07

to

Thank you, AC.
In some way I hope that my inexperienced participation to this group
may be useful.

Basically we have two syntactical (for the time being a do not write
neither lexical nor grammatical, let's abstract for a while)
catgegories

and

<GROUP_ID : <ID> ( "." <ID> )* >

GROUP_ID is a superset of ID.

In other terms if we tell with GROUP_ID' the following subset of
GROUP_ID

<GROUP_ID' : <ID> "." <ID> ( "." <ID> )* >

we can write (U stands for unite)

GROUP_ID = ID U GROUP_ID'

I knwo that perhaps my reasoning is naive or does not comply with
JavaCC specifications.

As unit tester, with a black box approach I would like to do
"something" like this: can I check that a given String belong to the
ID set or to the GROUP_ID?
In other terms I would like not to enter in the details, avoiding to
reformulate the question in terms of ID and (ID U GROUP_ID').

Reading of the Norvell's FAQ document I understood that, is generally
possible to make a character sequence match more than one token kind.
The FAQ document supplies, as well, a solution pattern (http://
www.engr.mun.ca/~theo/JavaCC-FAQ/javacc-faq-moz.htm#tth_sEc3.6),

Token b() : {Token t ; }{ (t= < A > | t= < B > ) {return t;} }

Actually
b stands for Factor()
A stands for ID
B stands for GROUP_ID

I went a step beyond: I replace explicit call to the token categories
with a call to a non terminal, Id() and GroupId().
Perhaps this is not correct.

> You mentioned that you had to use a production test for group id.
> Was it because GroupID was returning an <ID> rather than a <GROUP_ID>
> for some test cases, so the token kind did not match?

Exact, you are right. AC
I wrote as you suggested and extended form of runTokenTestCase() able
to handle a Set of token categories but all that would have put the
burden of details on the shoulders of the unit tester (pehaps she/he
would not have the kwowledge to go deeply amongst ID, GROUP_ID and
GROUP_ID' and such subtleties).

Now I would try with approach n°5 above: Only GROUP_ID along with a
non terminal Id() that narrows the legal input to the ID proper subset
by exclusion of any "."-containing token. I think it may work fine,
allowing independent calls to eithe Id() or GroupId() while hiding
details to the clients.

I' working to understand the "whole" philosophy and its rules of
thumb, sorry for the naive questions.
Thanks for supporting me.