Is this context sensitive?

Michael Burrows

unread,

Sep 25, 2002, 6:23:24 PM9/25/02

to

Hi
I'm writing a Javascript grammar. It has to distinguish between
arithmetical statements such as -
a = b/c/d;
and regular expressions such as -
a = /c/;

I have tried to resolve the confusion between regular expression
tokenization and arithmetic tokenization by specifying a start state
for the regular expression ie.

<ExpectingRegExp>
TOKEN:
{
< RegularExpressionPattern: ~["/", "*"]("\\/" | ~["/"])* > :
DEFAULT
}

void Literal() :
{}
{
<NullLiteral>
| <BooleanLiteral>
| <DecimalLiteral>
| <OctalLiteral>
| <HexLiteral>
| <FloatLiteral>
| <StringLiteral>
| "/" {token_source.SwitchTo(ParserConstants.ExpectingRegExp); }
<RegularExpressionPattern>
{token_source.SwitchTo(ParserConstants.DEFAULT); }"/"
}

This doesn't seem to work. I guess it's defeated by lookahead.

Any ideas?

Thanks
Mike

Eric Nickell

unread,

Sep 26, 2002, 12:59:22 PM9/26/02

to

On Wed, 25 Sep 2002 15:23:24 -0700, Michael Burrows wrote:
> This doesn't seem to work. I guess it's defeated by lookahead.
>
> Any ideas?

I have tried to avoid changing lexer states from the parser code, since
the lexer may be several tokens ahead of the parser.

Given your situation, here are some initial thoughts:

(1) Create a single token for a regular expression, including the initial
and final "/". This removes the confusion between "/" as an arithmetic
operator and a regular expression delimiter. You will have to deal with
the fact that your token includes the delimiters.

(2) Alternatively, see if you can determine which of your tokens may be
followed by a regular expression, and which cannot. (This is a more
difficult task, and more error prone. And some grammars may not be
amenable to this.) So rather than 2 states (DEFAULT and
EXPECTING_REG_EXP), you would have at least 3: DEFAULT (where the *lexer*
knows that a regular expression is illegal),
SLASH_MEANS_REGULAR_EXPRESSION, and EXPECTING_REG_EXP. You must think
through each token in your grammar and decide whether you should be in
DEFAULT or SLASH_MEANS_REGULAR_EXPRESSION afterward. (1) is easier.

hth
Eric

Michael Burrows

unread,

Sep 27, 2002, 2:03:44 PM9/27/02

to

Eric Nickell <no-nick...@parc.xerox.com> wrote in message news:<amveda$ndp$1...@news.parc.xerox.com>...

> On Wed, 25 Sep 2002 15:23:24 -0700, Michael Burrows wrote:
> > This doesn't seem to work. I guess it's defeated by lookahead.
> >
> > Any ideas?
>

> Given your situation, here are some initial thoughts:
>
> (1) Create a single token for a regular expression, including the initial
> and final "/". This removes the confusion between "/" as an arithmetic
> operator and a regular expression delimiter. You will have to deal with
> the fact that your token includes the delimiters.

I think (1) means that there are lots of cases that are supposed to be
arithmetic, but are interpreted as regexps.

>
> (2) Alternatively, see if you can determine which of your tokens may be
> followed by a regular expression, and which cannot. (This is a more
> difficult task, and more error prone. And some grammars may not be
> amenable to this.) So rather than 2 states (DEFAULT and
> EXPECTING_REG_EXP), you would have at least 3: DEFAULT (where the *lexer*
> knows that a regular expression is illegal),
> SLASH_MEANS_REGULAR_EXPRESSION, and EXPECTING_REG_EXP. You must think
> through each token in your grammar and decide whether you should be in
> DEFAULT or SLASH_MEANS_REGULAR_EXPRESSION afterward. (1) is easier.
>

(2) seems to solve the problem!

Many thanks,
Mike