Hi Jim,
thanks for taking an interest.
The big grammar is the tnsnames one found in the ANTLR4 Grammars
repository on GitHub.
On 30/10/15 02:07, Jim Idle wrote:
> Basically the language you are parsing is badly designed from a context
> free grammar point of view. If you are in charge of the syntax, then
> just throw the unquoted bit away
I'm afraid I have no control over the language in use, that's all
Oracle's domain, and it's currently working in a parser I've created,
unfortunately, I've just found out that one particular part of the
language allows single, double or unquoted strings.
I'm stuck with it, or so it seems. I could just ignore the unquoted
string part, but I've come across a couple of files that actually use
it, so I need to get on and fix my parser.
> However, in more practical terms, assuming you are using v4, you may be
> able to use lexical modes. Upon discovery of an '=' then look ahead to
> see if the first non-space is other than " or '. If you find a non-quote
> then enter a new lexical mode. In that mode have a token that eats
> anything up until a newline, then exits the mode. This will work so long
> as there are not complicated situations where = has some other meaning.
I was thinking about the lexer modes, but I thought that might be a
little overkill for just this "simple" change. It now turns out that
simple isn't quite so simple after all.
I am using V4 and as mentioned, I have the book (paper and Kindle
copies, just in case) and I'm checking out all about modes even as I
type. It seems I'll have to split my grammar into two to be able to use
them. Not a major problem I admit.
The actual code I'm trying to lex, and eventually parse, is:
IFILE = some/path/to/a/file
IFILE = 'some/path/to/a/file'
IFILE = "some/path/to/a/file"
It's the Oracle equivalent of C/C++'s #include, except it either uses
quotes as the delimiter, or no quotes and everything up to the EOL
(without leading, trailing or embedded whitespace.
I think lexer modes should be the easiest answer. (Or I hope so anyway!)
>
> Another option may be to leave the whole unquoted thing out of the
> lexer. Make sure you have a rule:
>
> As the last rule of your lexer. Then in the parser you can try
>
> ANY: . ;
>
> And make sure that you do not skip/hide NL.
>
> Then in your parser, try:
>
> assign: ID EQUAL ( DQUOTE | SQUOTE | ~NL*) NL;
>
> However this will quickly become convoluted and awkward if people can
> write things like:
>
> D = some
> text
> # This line ends the assignment.
>
> Or if elsewhere in the grammar, the NL is not significant.
I'm a little unsure of what you have explained here, sorry.
Whitespace is completely ignored everywhere in the "normal" manner of
lexing. It's only when I come across one of these unquoted strings that
the trailing newline is relevant to terminate the file path. Many of the
parameters in this file can be on multi-lines. Most are '(' and ')'
delimited, as for example:
...
(ADDRESS =
(PROTOCOL = TCP)
(HOST = host_name)
(PORT = 1234)
)
...
Quotes are not used often in the source files, just for file names to be
included (IFILE'd) or in very rare cases where a double quoted string is
required in only one or two other places.
> If this cannot be done from a lexing point of view, then you may have to
> hand craft a parser as it is clearly context dependent and is lexing as
> it parses.
I did one of those many years ago. It wasn't fun!
> If this is something like a config file, then i suggest that ANTLR might
> not be the best thing to use to read it. What you have will not work.
It is a config file, yes, an Oracle tnsnames.ora file to be specific. Up
until now, ANTLR4 has been a huge help in parsing it and - in my case -
highlighting syntax errors, duplications etc in the files, which need to
be sorted. Some of these config files can be thousands of lines long.
Thanks again.