Problems parsing quoted/unquoted strings.

2,555 views
Skip to first unread message

Norman Dunbar

unread,
Oct 29, 2015, 12:12:49 PM10/29/15
to antlr-di...@googlegroups.com
Afternoon All,

I've got a project in the wild and I need to update one particular rule
in which I have a parser rule which requires an ID '=' and then a string
of text for the value.

The text can be double quoted, single quoted or unquoted. With quotes,
there's no problem, without, I can't seem to be able to get the lexer to
recognise the unquoted text, and in fact it messes almost every other
token type and, depending on the UQ_STRING rule's definition, seems to
set them all as unquoted string, which is not helpful.

I've extracted the main problem to a minimal grammar, as follows, but in
this case, it's not setting the wrong token types, only the "big"
grammar does that.

cat test.g4
//-----------------------------------------------------------
grammar test ;

test : (stat)* EOF ;
stat : ID '=' (DQ_STRING | SQ_STRING | UQ_STRING) ;


WS : [ \t\r\n]+ -> skip ;
ID : [A-Za-z][A-Za-z0-9_\-]* ;
COMMENT : HASH (.)*? NL -> skip ;
DQ_STRING : DQ (~'"')*? DQ ;
SQ_STRING : SQ (~'\'')*? SQ ;

// This is last, surely anything above will be picked first? Like ID?
UQ_STRING : ~['"''\'']*? NL ;


fragment DQ : '"' ;
fragment SQ : '\'' ;
fragment NL : '\n' ;
fragment HASH : '#' ;
//-----------------------------------------------------------

I've seen a previous thread on a similar matter, but unfortunately, it
didn't help.

The following is my test file:

cat test.txt
#-----------------------------------------------------------
# This is a test file for my test grammar.
# Comments will be ignored.

# ID = DQ_STRING ...
fred = "Double quoted string"

# ID = SQ_STRING ...
barney ='Single quoted string'

# ID = UQ_STRING ...
wilma = unquoted string
#-----------------------------------------------------------


And here's a test run with trace and diagnostics enabled:

$ grun test test -trace -gui -diagnostics test.txt
enter test, LT(1)=fred
enter stat, LT(1)=fred
consume [@0,93:96='fred',<3>,5:0] rule stat
consume [@1,98:98='=',<1>,5:5] rule stat
consume [@2,100:121='"Double quoted string"',<5>,5:7] rule stat
exit stat, LT(1)=barney
enter stat, LT(1)=barney
consume [@3,145:150='barney',<3>,8:0] rule stat
consume [@4,152:152='=',<1>,8:7] rule stat
consume [@5,153:174=''Single quoted string'',<6>,8:8] rule stat
exit stat, LT(1)=wilma = unquoted string

line 11:0 extraneous input 'wilma = unquoted string\n' expecting {<EOF>, ID}
consume [@7,223:222='<EOF>',<-1>,13:0] rule test
exit test, LT(1)=<EOF>


The questions:

Q1. Why, if UQ_STRING is the very last lexer rule, is the input "wilma =
unquoted string" not being tokenised as ID = UQ_STRING? I was of the
impression that the higher up the lexer rules list a rule was, the
higher it's priority? So, in my perfect world, wilma should be an ID?

Q2. Probably related to rule priority, if I move the WS rule to the
bottom, or close to it, why do I get "extraneous input" errors when I
have blank lines in the test.txt file.

As mentioned, this is a minimal test case. I have the ANTLR4 book, and
I've tried all possible combinations of lexing plain text from as many
examples as I can find, none seem to work. It's driving me slightly
insane now.

As ever, I appreciate any help - but please be gentle, I do this for
fun, not for a living, and I'm really not very good at it. Yet.

Thanks.

--
Cheers,
Norm.

Jonathan Martin

unread,
Oct 29, 2015, 2:10:42 PM10/29/15
to antlr-discussion
Hi Norm,

Q1.  The lexer chooses the longest match first.  Rule order priority only comes into play if the longest matching string can be matched by multiple rules.  UQ_STRING matches all of 'wilma = unquoted string' whereas ID only matches 'wilma' so UQ_STRING is the longest match in this case.

Q2.  UQ_STRING can match any non-quote characters followed by a newline which means it can match strings such as '/r/n' or '  /t/n', i.e. you have two rules that can match whitespace strings of the same length: UQ_STRING and WS.  So now rule order matters.  If you put WS before UQ_STRING then any extraneous whitespace gets matched by WS and discarded.  When they are the other way round then the whitespace is matched by UQ_STRING and this token gets passed to the parser but there isn't a grammar rule which matches UQ_STRING when it appears by itself so it complains about extraneous input.

Do you really want your UQ_STRING rule to match any non-quote character before the newline?  Should it match an '=' character, for example?  Can you be more specific about what it can match?

Jonathan

Jim Idle

unread,
Oct 29, 2015, 10:07:45 PM10/29/15
to antlr-di...@googlegroups.com
Basically the language you are parsing is badly designed from a context free grammar point of view. If you are in charge of the syntax, then just throw the unquoted bit away ;)

However, in more practical terms, assuming you are using v4, you may be able to use lexical modes. Upon discovery of an '=' then look ahead to see if the first non-space is other than " or '. If you find a non-quote then enter a  new lexical mode. In that mode have a token that eats anything up until a newline, then exits the mode. This will work so long as there are not complicated situations where = has some other meaning. 

Another option may be to leave the whole unquoted thing out of the lexer. Make sure you have a rule:

As the last rule of your lexer. Then in the parser you can try

ANY: . ;

And make sure that you do not skip/hide NL.

Then in your parser, try:

assign: ID EQUAL ( DQUOTE | SQUOTE | ~NL*) NL;

However this will quickly become convoluted and awkward if people can write things like:

D = some
     text
# This line ends the assignment.

Or if elsewhere in the grammar, the NL is not significant.


If this cannot be done from a lexing point of view, then you may have to hand craft a parser as it is clearly context dependent and is lexing as it parses.

If this is something like a config file, then i suggest that ANTLR might not be the best thing to use to read it. What you have will not work.

Jim




--
Cheers,
Norm.

--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Norman Dunbar

unread,
Oct 30, 2015, 8:06:55 AM10/30/15
to antlr-di...@googlegroups.com
Morning Jonathan,

thanks for taking the time to read through my problem. I appreciate it.

On 29/10/15 18:10, Jonathan Martin wrote:
> Hi Norm,
>
> Q1. The lexer chooses the longest match first. Rule order priority
> only comes into play if the longest matching string can be matched by
> multiple rules.
This explanation makes perfect sense. Thanks. I understand this bit a
lot better now.


> Q2. UQ_STRING can match any non-quote characters followed by a newline
> which means it can match strings such as '/r/n' or ' /t/n', i.e. you
> have two rules that can match whitespace strings of the same length:
> UQ_STRING and WS. So now rule order matters. If you put WS before
> UQ_STRING then any extraneous whitespace gets matched by WS and
> discarded. When they are the other way round then the whitespace is
> matched by UQ_STRING and this token gets passed to the parser but there
> isn't a grammar rule which matches UQ_STRING when it appears by itself
> so it complains about extraneous input.
Ok, that's a lot clearer now as well. I'm thinking that it's going to be
a bit of a nightmare to sort out the big grammar!

> Do you really want your UQ_STRING rule to match any non-quote character
> before the newline? Should it match an '=' character, for example? Can
> you be more specific about what it can match?
It should not contain an '=', so that's been changed and makes a big
difference to the test grammar. That all lexes and parses perfectly now
- thanks. Well, on the test grammar it does, the big grammar is still
broken - but at least it's recognising the unquoted string, it's just
recognising lots of other unquoted strings where they most certainly are
not wanted.

Not to worry, I understand the problem better - thanks again.


Cheers,
Norm.


--
Norman Dunbar
Dunbar IT Consultants Ltd

Registered address:
27a Lidget Hill
Pudsey
West Yorkshire
United Kingdom
LS28 7LG

Company Number: 05132767
--
Cheers,
Norm. [TeamT]

Norman Dunbar

unread,
Oct 30, 2015, 8:07:24 AM10/30/15
to antlr-di...@googlegroups.com

Hi Jim,

thanks for taking an interest.

The big grammar is the tnsnames one found in the ANTLR4 Grammars
repository on GitHub.


On 30/10/15 02:07, Jim Idle wrote:
> Basically the language you are parsing is badly designed from a context
> free grammar point of view. If you are in charge of the syntax, then
> just throw the unquoted bit away
I'm afraid I have no control over the language in use, that's all
Oracle's domain, and it's currently working in a parser I've created,
unfortunately, I've just found out that one particular part of the
language allows single, double or unquoted strings.

I'm stuck with it, or so it seems. I could just ignore the unquoted
string part, but I've come across a couple of files that actually use
it, so I need to get on and fix my parser.

> However, in more practical terms, assuming you are using v4, you may be
> able to use lexical modes. Upon discovery of an '=' then look ahead to
> see if the first non-space is other than " or '. If you find a non-quote
> then enter a new lexical mode. In that mode have a token that eats
> anything up until a newline, then exits the mode. This will work so long
> as there are not complicated situations where = has some other meaning.
I was thinking about the lexer modes, but I thought that might be a
little overkill for just this "simple" change. It now turns out that
simple isn't quite so simple after all.

I am using V4 and as mentioned, I have the book (paper and Kindle
copies, just in case) and I'm checking out all about modes even as I
type. It seems I'll have to split my grammar into two to be able to use
them. Not a major problem I admit.

The actual code I'm trying to lex, and eventually parse, is:

IFILE = some/path/to/a/file
IFILE = 'some/path/to/a/file'
IFILE = "some/path/to/a/file"

It's the Oracle equivalent of C/C++'s #include, except it either uses
quotes as the delimiter, or no quotes and everything up to the EOL
(without leading, trailing or embedded whitespace.

I think lexer modes should be the easiest answer. (Or I hope so anyway!)

>
> Another option may be to leave the whole unquoted thing out of the
> lexer. Make sure you have a rule:
>
> As the last rule of your lexer. Then in the parser you can try
>
> ANY: . ;
>
> And make sure that you do not skip/hide NL.
>
> Then in your parser, try:
>
> assign: ID EQUAL ( DQUOTE | SQUOTE | ~NL*) NL;
>
> However this will quickly become convoluted and awkward if people can
> write things like:
>
> D = some
> text
> # This line ends the assignment.
>
> Or if elsewhere in the grammar, the NL is not significant.

I'm a little unsure of what you have explained here, sorry.

Whitespace is completely ignored everywhere in the "normal" manner of
lexing. It's only when I come across one of these unquoted strings that
the trailing newline is relevant to terminate the file path. Many of the
parameters in this file can be on multi-lines. Most are '(' and ')'
delimited, as for example:

...
(ADDRESS =
(PROTOCOL = TCP)
(HOST = host_name)
(PORT = 1234)
)
...

Quotes are not used often in the source files, just for file names to be
included (IFILE'd) or in very rare cases where a double quoted string is
required in only one or two other places.


> If this cannot be done from a lexing point of view, then you may have to
> hand craft a parser as it is clearly context dependent and is lexing as
> it parses.
I did one of those many years ago. It wasn't fun!


> If this is something like a config file, then i suggest that ANTLR might
> not be the best thing to use to read it. What you have will not work.
It is a config file, yes, an Oracle tnsnames.ora file to be specific. Up
until now, ANTLR4 has been a huge help in parsing it and - in my case -
highlighting syntax errors, duplications etc in the files, which need to
be sorted. Some of these config files can be thousands of lines long.

Thanks again.

Jonathan Martin

unread,
Oct 30, 2015, 11:51:58 AM10/30/15
to antlr-discussion
Hi Norm,

You said you're trying to parse tnsnames.ora files.  What specification are you working from?  The specification here (https://docs.oracle.com/cd/A57673_01/DOC/net/doc/NWUS233/apb.htm), for example, doesn't specify that values are newline terminated.  Instead, there is a well defined set of characters that can be used to form a non-quoted value:

Network Character Set

The network character set consists of the following characters. Values given for keywords must be made up only of characters in this set. Connect descriptors must be made up of single-byte characters.

		A-Z, a-z
		0-9
		( ) < > / \ 
		, . : ; ' " = - _ 
		$ + * # & ! % ? @ 

Within this character set, the following symbols are reserved:

		(  ) = \ " ' # 

Reserved symbols should be used only as delimiters, not as part of a keyword or a value unless the keyword or value is quoted. Either single or double quotes can be used to enclose a value containing reserved symbols. To include a quote within a value that is surrounded by quotes, use different quote types. The backslash (\) is used as an escape character.

A specific example of the use of reserved symbols is in a numeric DECnet object within a DECnet address. As defined by DECnet, an OBJECT can be a name such as ABC or a value such as #123. These would be entered in the form:

(OBJECT=ABC)

or

(OBJECT=\#123) 

The numeric DECnet object requires a symbol that is reserved by TNS. Because # is a reserved symbol, the character must be preceded by a backslash. See the Oracle Protocol Adapter information for your platform for further details on DECnet.

The following characters can be used within the structure of a connect descriptor, but cannot be part of a keyword or value:

		<space> <tab> <CR> <newline>


Doesn't this describe the valid non-quoted values that you want to match?

Jonathan

Norman Dunbar

unread,
Oct 30, 2015, 1:08:14 PM10/30/15
to antlr-di...@googlegroups.com

Hi Jonathan,

I'm using the 11G docs found at
http://docs.oracle.com/cd/E11882_01/network.112/e10835/tnsnames.htm as
my main documentation. However, the IFILE = "/some/path/etc" (with or
without quotes) is not actually listed in there, but is listed in the
Reference Guide for the init.ora parameters.

The string after the '=' is permitted to be any character that is legal
on the underlying file system where the tnsnames.ora file lives.

I've got a very nearly working version now using lexer modes, just a few
minor bugs to kill, hopefully soon, and it will be done.

Without modes, the main grammar gets all confused, as do I!
Reply all
Reply to author
Forward
0 new messages