Javacc: two tokens have the same definition ?

Link

unread,

Apr 23, 2008, 9:52:49 AM4/23/08

to

Hi everyone !

I use JTB/JavaCC to developpe a compiler for my language.

I have a trivial JavaCC problem which is blocking me from otherwise
having a JTB/JavaCC
solution.

In the grammar specification rules, i have two tokens which have the
same definition.
...
TOKEN:{
<#IDENT: ["A"-"Z"](["a"-"z","A"-"Z","0"-"9","-","_"])*>
| <IdentTypeA: <IDENT> >
| <IdentTypeB: <IDENT> >
}
...

In spite of having the same definition, I would like to keep theses
two tokens.

JTB and JavaCC read and generate files without problem.
And IdenTypeA is well defined but not for IdentTypeB.

This problem is some kind of "First declared Firs Defined".
(Yes it's true that the rule that appears first in the specification
file has priority)

Do you have any idea of this problem ?
Is there a way to get around this problem ?

Thank

Best regards
Link

Message has been deleted

Cesare Zecca

unread,

Apr 24, 2008, 8:08:02 AM4/24/08

to

On Apr 23, 3:52 pm, Link <leon.lim.i...@gmail.com> wrote:
> Hi everyone !
>
> I use JTB/JavaCC to developpe a compiler for my language.
>
> I have a trivial JavaCC problem which is blocking me from otherwise
> having a JTB/JavaCC
> solution.
>
> In the grammar specification rules, i have two tokens which have the
> same definition.

[...]

I'm not so expert about JavaCC but I know that the token manager,
given the current rules, will always return <identTypeA>.
The answer could be in the reason for your request.
Why do you need two names for the same token category?

Link

unread,

Apr 24, 2008, 2:41:12 PM4/24/08

to

thank you for your response
(and sorry for my bad english ^^)

[...]

>
> > In the grammar specification rules, i have two tokens which have the
> > same definition.
>
> [...]
>
> I'm not so expert about JavaCC but I know that the token manager,
> given the current rules, will always return <identTypeA>.
> The answer could be in the reason for your request.

Yes of cause.

> Why do you need two names for the same token category?

In my declarative langage i would like to distinguish an <IdentTypeA>
from an <IdentTypeB>.
Here some examples of code in my language:

elementA AnIdentTypeA = < body > ;
elementB AnIdentTypeB = < body > ;

By this way, the identifier "AnIdentTypeA" will be considered as an
identifier of
type A and the same way for "AnIdentTypeB"...

In my gramma specification, all expressions are typed even the
terminal symbols. It's usful for me in semantic analysis phase in
order to do verification etc.

Yes of cause, I can just use simply IDENT instead of IdentTypeX, but
it is preferable
for my language to have the type for each Identifier.

So if the token manager will always return <identTypeA> I have no
choice but to use
IDENT :( ^^

I will appreciate your reply.

Thank again
++

Chris F Clark

unread,

Apr 26, 2008, 12:12:37 AM4/26/08

to

Link <leon.l...@gmail.com> writes:

> In the grammar specification rules, i have two tokens which have the
> same definition.
>

> In my declarative langage i would like to distinguish an <IdentTypeA>
> from an <IdentTypeB>.

Your desire is not an uncommon one, and there is at least a partial
solution. However, first let me explain why you can't get exactly
what you want. When the lexer recognizes that a set of characters
matches one of the tokens, it returns the token matched. The lexer
does this without consulting the context or any other information.
The complex technical reasons essentially reduce to that doing it that
way makes it easy to prove correct and the resulting lexer runs
fast. (If you want an explanation of the why's of that, just ask.)
Given that the lexer doesn't look at anything but the characters that
make up token, if you have two tokens that have the same spelling, the
lexer has no way to tell which token type to return.

If you want to introduce context into the token type, you simply turn
the token into a non-terminal. Declare an IDENT token and identTypeA
and identTypeB non-terminals, each which simply reduce to an IDENT
token. The parser has context where the lexer doesn't. (In fact,
that is the essential difference between a lexer and a parser--the
parser is the technology that understands context.)

However, in your case, you probably want all uses of a token with the
same spelling to get the same "token" type. In that case, what you
want is to have the type kept in the symbol table, in essence a
"symbol" type. The symbol table is the place to keep information
about all tokens representing the same entity. In the rules which
turn IDENT tokens into identTypeA or identTypeB non-terminals, you
query the symbol table and check that the type you want to use matches
the type you have recorded in the symbol table and if necessary you
record the information you want saved for later uses.

Hope this helps,
-Chris

******************************************************************************
Chris Clark Internet: christoph...@compiler-resources.com
Compiler Resources, Inc. or: com...@world.std.com
23 Bailey Rd Web Site: http://world.std.com/~compres
Berlin, MA 01503 voice: (508) 435-5016
USA fax: (978) 838-0263 (24 hours)
------------------------------------------------------------------------------

Message has been deleted

Link

unread,

Apr 28, 2008, 7:36:21 AM4/28/08

to

Firstly I would like to thank you very much for all your explanation.
It really helped me out to understand thinks clearly.
And I'm sorry for my reply this late.

On 26 avr, 06:12, Chris F Clark <c...@shell01.TheWorld.com> wrote:
> Your desire is not an uncommon one, and there is at least a partial
> solution. However, first let me explain why you can't get exactly
> what you want. When the lexer recognizes that a set of characters
> matches one of the tokens, it returns the token matched. The lexer
> does this without consulting the context or any other information.

[...]

Yes, i know that the lexer does "this" without consulting the context
or any other information. But what i was not sure is when the lexer
reads "AnIdentTypeB" (without talking about context) and it can't
interpret it as an "IDENT". Finally, you're right. If the lexer
interpret "AnIdentTypeB" as an "IDENT" so why not simply use
"IDENT"...

I will appreciate your explanation about "The complex technical

reasons essentially reduce to that doing it that way makes it easy

to prove correct and the resulting lexer runs fast". Please give me
that explanation (a short one) if only you have time :)

> If you want to introduce context into the token type, you simply turn
> the token into a non-terminal. Declare an IDENT token and identTypeA
> and identTypeB non-terminals, each which simply reduce to an IDENT

> token. [...]

It seems to work well for my problem. I have tested with an simple
example and it worked. I will integrate this solution in my grammar
specification rules.

> However, in your case, you probably want all uses of a token with the
> same spelling to get the same "token" type. In that case, what you
> want is to have the type kept in the symbol table, in essence a
> "symbol" type. The symbol table is the place to keep information
> about all tokens representing the same entity. In the rules which
> turn IDENT tokens into identTypeA or identTypeB non-terminals, you
> query the symbol table and check that the type you want to use matches
> the type you have recorded in the symbol table and if necessary you
> record the information you want saved for later uses.

Yes, i will use the symbol table. For instant, I'm not on this phase
yet.

Thanks
Link