Question on Grammar Specification

33 views
Skip to first unread message

LogicProgrammer

unread,
Nov 18, 2009, 7:13:55 PM11/18/09
to lepl
Hi All,

I'm attempting to define a grammar for a simple logic programming
language. The initial portion of the EBNF is as follows:
(comments follow a %)

OBJECT :== a-z[a-zA-Z0-9]* %alpha-numeric string which begins
with a lower case letter
VARIABLE :== A-Z[a-zA-Z0-9]* %alpha-numeric string which begins
with an uppercase letter
FUNC :== OBJECT
TERM :== OBJECT | VARIABLE | FUNC ( TERM [, TERM]* ) % term is
either an object, variable, or a function symbol of arity >= 1 -
parenthesis are literal tokens

PRED :== OBJECT
PREDSYM :== PRED | PRED ( TERM [, TERM]* )

To my understanding I should be able to use LEPL along the following
lines:

OBJECT = Token('[a-z][a-zA-Z0-9]*')
VARIABLE = Token('[A-Z][a-zA-Z0-9]+')
FUNC = OBJECT
TERM = Delayed()
TERM += OBJECT | VARIABLE | FUNC & '(' & TERM & ZeroOrMore(',' & TERM)
& ')'

etc.

For some reason though (probably a stupid mistake on my part) this
fails to parse the following:

parser = TERM.parse_string
TERM.parse("p(X)")

Am I doing something completely wrong?

Thank you kindly,
Gregory Gelfond

LogicProgrammer

unread,
Nov 18, 2009, 7:15:02 PM11/18/09
to lepl
Also if I want the the results of a parse to be represented as a
string object, am I correct that I would append '> str' to the end of
the corresponding grammar rule?

andrew cooke

unread,
Nov 18, 2009, 7:31:05 PM11/18/09
to le...@googlegroups.com
i'm not sure exactly what problem you are hitting - you should have
got an error saying something about mixing tokens and non-tokens
because at the very least you need to make "(" and ")" tokens.

andrew


2009/11/18 LogicProgrammer <logicpr...@gmail.com>:
> --
>
> You received this message because you are subscribed to the Google Groups "lepl" group.
> To post to this group, send email to le...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/lepl?hl=.
>
>
>

LogicProgrammer

unread,
Nov 18, 2009, 7:32:19 PM11/18/09
to lepl
Also, I understand how to parse a single string, but what I want to
parse an entire file?

Suppose that I had a rule which says something like this:

STATEMENT ::= TERM [, TERM]* '.'

i.e - a statement is a comma-delimited sequence of 1 or more terms
followed by period. And a program is defined as one or more
statements:

PROGRAM ::= STATEMENT+

On a successful parse I would like to get a list of all of the
"statement objects".

How would I attempt something like this?

Thanks again.

LogicProgrammer

unread,
Nov 18, 2009, 7:42:03 PM11/18/09
to lepl
Ok here is what I'm trying to do (in full):

>>> from lepl import *
>>> LPAREN = Token('(')
>>> RPAREN = Token(')')
>>> OBJECT = Token('[a-z][a-zA-Z0-9]*')
>>> VAR = Token('[A-Z][a-zA-Z0-9]*')
>>> FUNC = OBJECT
>>> TERM = Delayed()
>>> TERM += OBJECT | VAR | FUNC & LPAREN & TERM & ZeroOrMore(Token(',') & TERM) & RPAREN
>>> parser = TERM.parse_string
>>> TERM.parse("p(X)")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.6/site-packages/lepl/matchers.py", line 189,
in parse
return self.null_parser(config, **kargs)(stream)
File "/Library/Python/2.6/site-packages/lepl/matchers.py", line 152,
in null_parser
return make_parser(self, config.stream.null, config, kargs)
File "/Library/Python/2.6/site-packages/lepl/parser.py", line 247,
in make_parser
matcher = make_matcher(matcher, stream, config, kargs)
File "/Library/Python/2.6/site-packages/lepl/parser.py", line 229,
in make_matcher
matcher = rewriter(matcher)
File "/Library/Python/2.6/site-packages/lepl/lexer/rewriters.py",
line 126, in rewriter
return Lexer(matcher, tokens, alphabet, discard, source=source)
File "/Library/Python/2.6/site-packages/lepl/lexer/matchers.py",
line 324, in __init__
for t in tokens]).dfa()
File "/Library/Python/2.6/site-packages/lepl/regexp/core.py", line
430, in multiple
alphabet) for (label, regexp) in regexps],
File "/Library/Python/2.6/site-packages/lepl/regexp/core.py", line
408, in _coerce
.format(regexp, alphabet))
lepl.regexp.core.RegexpError: Cannot parse regexp '(' using
<lepl.regexp.unicode.UnicodeAlphabet object at 0x100673f10>

Thanks for the quick reply Andrew by the way.

- Greg

On Nov 18, 5:31 pm, andrew cooke <and...@acooke.org> wrote:
> i'm not sure exactly what problem you are hitting - you should have
> got an error saying something about mixing tokens and non-tokens
> because at the very least you need to make "(" and ")" tokens.
>
> andrew
>
> 2009/11/18 LogicProgrammer <logicprogram...@gmail.com>:

andrey churin

unread,
Nov 19, 2009, 2:52:53 AM11/19/09
to lepl
Hi,

from lepl import *
LPAREN = Token('\(')
RPAREN = Token('\)')
OBJECT = Token('[a-z][a-zA-Z0-9]*')
VAR = Token('[A-Z][a-zA-Z0-9]*')
FUNC = OBJECT
TERM = Delayed()
TERM += FUNC & LPAREN & TERM & ZeroOrMore(Token(',') & TERM) & RPAREN
| OBJECT | VAR
parser = TERM.parse_string
print TERM.parse("p(X)")

andrey

andrew cooke

unread,
Nov 19, 2009, 4:48:56 AM11/19/09
to le...@googlegroups.com
2009/11/18 LogicProgrammer <logicpr...@gmail.com>:
> Also, I understand how to parse a single string, but what I want to
> parse an entire file?

pass the file object to parse_file() or pass the file name to parse_path().

> Suppose that I had a rule which says something like this:
>
> STATEMENT ::= TERM [, TERM]* '.'
>
> i.e - a statement is a comma-delimited sequence of 1 or more terms
> followed by period. And a program is defined as one or more
> statements:
>
> PROGRAM ::= STATEMENT+
>
> On a successful parse I would like to get a list of all of the
> "statement objects".
>
> How would I attempt something like this?

statement = term[0:,","] & "."
program = statement[1:]

andrew cooke

unread,
Nov 19, 2009, 4:50:07 AM11/19/09
to le...@googlegroups.com
tokens are defined by regular expressions.

Token("(") means is a token that matces the regexp "("

"(" is not a valid regexp.

you want Token(r"\(")

2009/11/18 LogicProgrammer <logicpr...@gmail.com>:

LogicProgrammer

unread,
Nov 19, 2009, 4:17:44 PM11/19/09
to lepl
Ok so I've made the suggested corrections as follows and the grammar
looks like this:

from lepl import *

# language grammar specification

COMMA = Token(',')
LPAREN = Token(r"\(")
RPAREN = Token(r"\)")
OBJECT = Token('[a-z][a-zA-Z0-9]*')
VARIABLE = Token('[A-Z][a-zA-Z0-9]*')
FUNCTION = Token('[a-z][a-zA-Z0-9]*')

TERM = Delayed()
TERM += OBJECT | VARIABLE | FUNCTION & LPAREN & TERM & ZeroOrMore
(COMMA & TERM) & RPAREN

However, the parse does not work in the way that I expect.

Here is what I'm seeing:

>>> from ActionLanguageAL import *
>>> parser = TERM.parse_string
>>> TERM.parse("hello")
['hello']
>>> TERM.parse("hello(X)")
['hello']

With regards to TERM.parse("hello(X)"), I was expecting the return
value to be ['hello(X)']. I may be missing something but I can't seem
to tell from the documentation what is really going on.

Thank you for the help by the way.

- Greg

On Nov 19, 2:50 am, andrew cooke <and...@acooke.org> wrote:
> tokens are defined by regular expressions.
>
> Token("(") means is a token that matces the regexp "("
>
> "(" is not a valid regexp.
>
> you want Token(r"\(")
>
> 2009/11/18 LogicProgrammer <logicprogram...@gmail.com>:

LogicProgrammer

unread,
Nov 19, 2009, 4:46:44 PM11/19/09
to lepl
I apologize for all of the novice questions.

Suppose I now extend the grammar as follows:

from lepl import *

# language grammar specification

IF = ~Token('if')
COMMA = Token(',')
PERIOD = Token('.')
LPAREN = Token(r"\(")
RPAREN = Token(r"\)")
OBJECT = Token('[a-z][a-zA-Z0-9]*')
VARIABLE = Token('[A-Z][a-zA-Z0-9]*')
FUNCTION = Token('[a-z][a-zA-Z0-9]*')

TERM = Delayed()
TERM += FUNCTION & LPAREN & TERM & ZeroOrMore(COMMA & TERM) & RPAREN |
\
OBJECT | \
VARIABLE > ''.join


SCL = TERM & IF & TERM & PERIOD

The last rule to my knowledge states that any string of the from "TERM
if TERM." is an SCL. However:

>>> from ActionLanguageAL import *
>>> parser = SCL.parse_string
>>> print SCL.parse("p if a.")
None

Both "p" and "a" satisfy TERM

>>> TERM.parse("p")
['p']
>>> TERM.parse("a")
['a']

but for some reason SCL does not parse "p if a." Can some please shed
some light as to why?

andrew cooke

unread,
Nov 19, 2009, 4:59:09 PM11/19/09
to le...@googlegroups.com
hi,

so three separate but related things here:

1 - lepl will give all possible parses for your expression if you use
"match" rather than "parse". if you did that you would see both
"hello" and "hello(...)" (i hope)

2 - if you want to force lepl to match the entire string, end the
parser with Eos()

3 - matchers are matched in order, so the reason you see "hello"
before "hello(...)" is because VARIABLE comes before FUNCTION in TERM.

sorry the docs aren't better (it's actually very hard to write good
docs, at least for me), but i will probaby start collecting things
like this into a faq.

andrew



2009/11/19 LogicProgrammer <logicpr...@gmail.com>:

LogicProgrammer

unread,
Nov 19, 2009, 5:09:48 PM11/19/09
to lepl
The documentation is actually fairly good (it's better than most
online documentation that I've seen :-)).

I like the library quite a bit, I just need to iron out some stuff
(I'm trying to write a compiler using the library as part of a
research project).

Thank you for your patience and help.

- Greg

On Nov 19, 2:59 pm, andrew cooke <and...@acooke.org> wrote:
> hi,
>
> so three separate but related things here:
>
> 1 - lepl will give all possible parses for your expression if you use
> "match" rather than "parse".  if you did that you would see both
> "hello" and "hello(...)" (i hope)
>
> 2 - if you want to force lepl to match the entire string, end the
> parser with Eos()
>
> 3 - matchers are matched in order, so the reason you see "hello"
> before "hello(...)" is because VARIABLE comes before FUNCTION in TERM.
>
> sorry the docs aren't better (it's actually very hard to write good
> docs, at least for me), but i will probaby start collecting things
> like this into a faq.
>
> andrew
>
> 2009/11/19 LogicProgrammer <logicprogram...@gmail.com>:

LogicProgrammer

unread,
Nov 19, 2009, 5:15:15 PM11/19/09
to lepl
To give some broader context into what I'm trying to do, I essentially
want something like this:

from lepl import *

# language grammar specification - basic objects

IMPOSS = Token('impossible')
CAUSES = Token('causes')
IF = Token('if')
COMMA = Token(',')
PERIOD = Token('.')
LPAREN = Token(r"\(")
RPAREN = Token(r"\)")
OBJECT = Token('[a-z][a-zA-Z0-9]*')
VARIABLE = Token('[A-Z][a-zA-Z0-9]*')
FUNCTION = Token('[a-z][a-zA-Z0-9]*')

TERM = Delayed()
TERM += FUNCTION & LPAREN & TERM & ZeroOrMore(COMMA & TERM) & RPAREN |
\
OBJECT | \
VARIABLE > ''.join

# language grammar specification - causal laws

SCL = TERM & IF & TERM & PERIOD
DCL = TERM & CAUSES & TERM & IF & TERM & PERIOD
EXC = IMPOSS & TERM & IF & TERM & PERIOD

# language grammar specification - action description

STATEMENT = SCL | DCL | EXC

PROGRAM = OneOrMore(STATEMENT)

If I parse an input text (a file) using PROGRAM as my start symbol, I
would like to have a list of objects returned, where each object is a
particular representation of some statement type (either SCL, DCL, or
EXC). I should then be able to iterate through the list and invoke a
common method to get a translation of the entire program.

So for SCL, I would have a class StaticCausalLaw, and the grammar rule
+ semantic action would be (to my knowledge):

SCL = TERM & IF & TERM & PERIOD > Args(StaticCausalLaw) # constructor
for StaticCausalLaw takes 2 args, corresponding to the 2 TERMS

similarly for the other statement types.

Thank you kindly,
Gregory Gelfond

andrew cooke

unread,
Nov 19, 2009, 6:07:00 PM11/19/09
to le...@googlegroups.com
2 things stand out to me:

1 - Token('.'). do you really mean that? remember that tokens match
regular expressions....

2 - this may be a matter of style. i think it will work your way too
(LEPL can handle more than one token matching) but Token('impossible')
and Token('[a-z][a-zA-Z0-9]*') are both going to match the same thing.
instead i would do:

WORD = Token('[a-z][a-zA-Z0-9]*')
IMPOSS = WORD('impossible')
OBJECT = WORD

so only one kind of token will be generated for all of those, and
OBJECT will match as it did before. IMPOSS will match the *same*
token, but will then also test the contents for a literal match.

in other words, personally, i define tokens at a lower level than the
grammars - they're just the basic "different kinds of chunks of text"
you can match. then *inside* the token i add extra conditions. so i
normally have just three kinds of tokens: words, numbers and symbols.

see http://www.acooke.org/lepl/lexer.html#use half way down where i
talk abot specialisation.

andrew


2009/11/19 LogicProgrammer <logicpr...@gmail.com>:

LogicProgrammer

unread,
Nov 19, 2009, 6:19:43 PM11/19/09
to lepl
Hm,

The language isn't quite like that. I have three reserved words:
"impossible", "if", and "causes".

Terms make up the basic building block of everything else. A term is
either an object constant, variable or a function constant of arity >
0 (which are not reserved words).

For the moment causal law is a statement of one of the following
forms:

term causes term if term.
impossible term if term.
term if term.

At least that's the type of grammar I was going for.

I attempted changing the basic "word" type elements from Token to
Word, giving me the following grammar:

from lepl import *

# language grammar specification - basic objects

IMPOSS = Word('impossible')
CAUSES = Word('causes')
IF = Word('if')
COMMA = Token(',')
PERIOD = Token('.')
LPAREN = Token(r"\(")
RPAREN = Token(r"\)")
OBJECT = Token('[a-z][a-zA-Z0-9]*')
VARIABLE = Token('[A-Z][a-zA-Z0-9]*')
FUNCTION = Token('[a-z][a-zA-Z0-9]*')

TERM = Delayed()
TERM += FUNCTION & LPAREN & TERM & ZeroOrMore(COMMA & TERM) & RPAREN |
\
OBJECT | \
VARIABLE > ''.join

# language grammar specification - causal laws

SCL = TERM & IF & TERM & PERIOD
DCL = TERM & CAUSES & TERM & IF & TERM & PERIOD
EXC = IMPOSS & TERM & IF & TERM & PERIOD

# language grammar specification - action description

STATEMENT = SCL | DCL | EXC

PROGRAM = OneOrMore(STATEMENT)

But I get the following errors:

-bash-3.2 [src]$ python
Python 2.6.1 (r261:67515, Jul 7 2009, 23:51:51)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from ActionLanguageAL import *
>>> parser = SCL.parse_string
>>> SCL.parse("a if p.")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.6/site-packages/lepl/matchers.py", line 189,
in parse
return self.null_parser(config, **kargs)(stream)
File "/Library/Python/2.6/site-packages/lepl/matchers.py", line 152,
in null_parser
return make_parser(self, config.stream.null, config, kargs)
File "/Library/Python/2.6/site-packages/lepl/parser.py", line 247,
in make_parser
matcher = make_matcher(matcher, stream, config, kargs)
File "/Library/Python/2.6/site-packages/lepl/parser.py", line 229,
in make_matcher
matcher = rewriter(matcher)
File "/Library/Python/2.6/site-packages/lepl/lexer/rewriters.py",
line 124, in rewriter
tokens = find_tokens(matcher)
File "/Library/Python/2.6/site-packages/lepl/lexer/rewriters.py",
line 62, in find_tokens
for n in non_tokens)))
lepl.lexer.support.LexerError: The grammar contains a mix of Tokens
and non-Token matchers at the top level. If Tokens are used then non-
token matchers that consume input must only appear "inside" Tokens.
The non-Token matchers include: Any.



On Nov 19, 4:07 pm, andrew cooke <and...@acooke.org> wrote:
> 2 things stand out to me:
>
> 1 - Token('.').  do you really mean that?  remember that tokens match
> regular expressions....
>
> 2 - this may be a matter of style. i think it will work your way too
> (LEPL can handle more than one token matching) but Token('impossible')
> and Token('[a-z][a-zA-Z0-9]*') are both going to match the same thing.
>  instead i would do:
>
> WORD = Token('[a-z][a-zA-Z0-9]*')
> IMPOSS = WORD('impossible')
> OBJECT = WORD
>
> so only one kind of token will be generated for all of those, and
> OBJECT will match as it did before.  IMPOSS will match the *same*
> token, but will then also test the contents for a literal match.
>
> in other words, personally, i define tokens at a lower level than the
> grammars - they're just the basic "different kinds of chunks of text"
> you can match.  then *inside* the token i add extra conditions.  so i
> normally have just three kinds of tokens: words, numbers and symbols.
>
> seehttp://www.acooke.org/lepl/lexer.html#usehalf way down where i
> talk abot specialisation.
>
> andrew
>
> 2009/11/19 LogicProgrammer <logicprogram...@gmail.com>:
> >> > >> > You received this message...
>
> read more »

andrew cooke

unread,
Nov 19, 2009, 7:48:57 PM11/19/09
to le...@googlegroups.com
ok, so slow down and think a bit. this isn't what i described.
please read my email again. also, you are still using Token('.').

andrew

2009/11/19 LogicProgrammer <logicpr...@gmail.com>:

andrey churin

unread,
Nov 19, 2009, 8:01:53 PM11/19/09
to lepl
Do not mix tokens and matchers "at the top level":

WORD = Token('[a-z][a-zA-Z0-9]*') # see previous Andrew post
IMPOSS = WORD('impossible')
CAUSES = WORD('causes')
IF = WORD('if')
COMMA = Token(',')
PERIOD = Token(r'\.') # . is one of the special characters, see
previous Andrew post

ps: please, read carefully the previous posts.
> > seehttp://www.acooke.org/lepl/lexer.html#usehalfway down where i
> ...
>
> read more »

LogicProgrammer

unread,
Nov 19, 2009, 8:41:30 PM11/19/09
to lepl
Ok I understand now Andrew.

Thank you for being patient - trying to work through a new library
with an ugly cold didn't help the process.

I appreciate your time very much and thank you for developing this.

- Greg

On Nov 19, 5:48 pm, andrew cooke <and...@acooke.org> wrote:
> ok, so slow down and think a bit.  this isn't what i described.
> please read my email again.  also, you are still using Token('.').
>
> andrew
>
> 2009/11/19 LogicProgrammer <logicprogram...@gmail.com>:
> >> seehttp://www.acooke.org/lepl/lexer.html#usehalfway down where i
> >> >> > >> >    return...
>
> read more »
Reply all
Reply to author
Forward
0 new messages