xpath grammar - left recursion

372 views
Skip to first unread message

CG

unread,
Jul 23, 2010, 3:03:13 AM7/23/10
to PEG.js: Parser Generator for JavaScript
Hi all
I discovered PEG.js while looking for ways to parse XPath expressions
and it looks promising.

The problem, I am encountering at this stage is that the grammar
processor is detecting recursions in the grammar (the first on is on
RelativeLocationPath).

Is there a way to handle this ? I think the grammar is correctly
transcripted, but I am totally new to PEG in general, and might miss
something trivial.

Cheers and thanks
Christophe

//2 Location Paths
LocationPath
= RelativeLocationPath
/AbsoluteLocationPath

AbsoluteLocationPath
= '/' RelativeLocationPath?
/AbbreviatedAbsoluteLocationPath
RelativeLocationPath
= Step
/ RelativeLocationPath '/' Step
/ AbbreviatedRelativeLocationPath
//2.1 Location Steps
Step
= AxisSpecifier NodeTest Predicate*
/ AbbreviatedStep
AxisSpecifier
= AxisName '::'
/ AbbreviatedAxisSpecifier

//2.2 Axes
AxisName
= 'ancestor'
/ 'ancestor-or-self'
/ 'attribute'
/ 'child'
/ 'descendant'
/ 'descendant-or-self'
/ 'following'
/ 'following-sibling'
/ 'namespace'
/ 'parent'
/ 'preceding'
/ 'preceding-sibling'
/ 'self'
NodeTest
= NameTest
/ NodeType '(' ')'
// 'processing-instruction' '(' Literal ')'
//2.4 Predicates
Predicate
= '[' PredicateExpr ']'
PredicateExpr
= Expr
//2.5 Abbreviated Syntax
AbbreviatedAbsoluteLocationPath
= '//' RelativeLocationPath
AbbreviatedRelativeLocationPath
= RelativeLocationPath '//' Step
AbbreviatedStep
= '.'
/ '..'
AbbreviatedAxisSpecifier
= '@'?
// 3 Expressions
//3.1 Basics
Expr
= OrExpr
PrimaryExpr
= //VariableReference
'(' Expr ')'
/ Literal
/ Number
/ FunctionCall

//3.2 Function Calls
FunctionCall
= FunctionName '(' ( Argument ( ',' Argument )* )? ')'
Argument
= Expr

//3.3 Node-sets
UnionExpr
= PathExpr
/ UnionExpr '|' PathExpr
PathExpr
= LocationPath
/ FilterExpr
/ FilterExpr '/' RelativeLocationPath
/ FilterExpr '//' RelativeLocationPath
FilterExpr
= PrimaryExpr
/ FilterExpr Predicate

//3.4 Booleans
OrExpr = AndExpr
/ OrExpr 'or' AndExpr
AndExpr
= EqualityExpr
/ AndExpr 'and' EqualityExpr
EqualityExpr
= RelationalExpr
/ EqualityExpr '=' RelationalExpr
/ EqualityExpr '!=' RelationalExpr
RelationalExpr
= AdditiveExpr
/ RelationalExpr '<' AdditiveExpr
/ RelationalExpr '>' AdditiveExpr
/ RelationalExpr '<=' AdditiveExpr
/ RelationalExpr '>=' AdditiveExpr
//3.5 Numbers
AdditiveExpr
= MultiplicativeExpr
/ AdditiveExpr '+' MultiplicativeExpr
/ AdditiveExpr '-' MultiplicativeExpr
MultiplicativeExpr
= UnaryExpr
/ MultiplicativeExpr MultiplyOperator UnaryExpr
/ MultiplicativeExpr 'div' UnaryExpr
/ MultiplicativeExpr 'mod' UnaryExpr
UnaryExpr
= UnionExpr
/ '-' UnaryExpr

//3.7 Lexical Structure
ExprToken
= '(' / ')' / '[' / ']' / '.' / '..' / '@' / ',' / '::'
/ NameTest
/ NodeType
/ Operator _
/ FunctionName
/ AxisName
/ Literal _
/ Number
// VariableReference
Literal
= '"' [^"]* '"'
/ "'" [^']* "'"
Number
= Digits ('.' Digits?)?
/ '.' Digits
Digits
= [0-9]+
Operator
= OperatorName
/ MultiplyOperator
/ '/' / '//' / '|' / '+' / '-' / '=' / '!=' / '<' / '<=' / '>' /
'>='
OperatorName
= 'and' / 'or' / 'mod' / 'div'
MultiplyOperator
= '*'
//FunctionName
// = QName NodeType

//VariableReference
// = '$' QName */

NameTest
= '*'
// NCName ':' '*'
/ QName
QName
= Name

NameStartChar
= ":" / [A-Z] / "_" / [a-z] / [u00C0-u00D6] / [u00D8-u00F6] /
[u00F8-u02FF] / [u0370-u037D] / [u037F-u1FFF] / [u200C-u200D] / [u2070-
u218F] / [u2C00-u2FEF] / [u3001-uD7FF] / [uF900-uFDCF] / [uFDF0-
uFFFD] / [u10000-uEFFFF]
NameChar
= NameStartChar / "-" / "." / [0-9] / [u00B7] / [u0300-u036F] /
[u203F-u2040]
Name
= NameStartChar (NameChar)*

NodeType
= 'comment'
/ 'text'
// 'processing-instruction'
/ 'node'

//4 Core Function Library (XPATH1)

FunctionName
= XPath1CoreFunctions
/XFormsCoreFunctions
/XPath1CoreFunctions

XPath1CoreFunctions
= NodeSetFunctions
/StringFunctions
/BooleanFunctions
/NumberFunctions

//4.1 Node Set Functions
NodeSetFunctions
= "last"
/"position"
/"count"
/"id"
/"local-name"
/"namespace-uri"
/"name"

//4.2 String Functions
StringFunctions
= "string"
/"concat"
/"starts-with"
/"contains"
/"substring-before"
/"substring-after"
/"substring"
/"string-length"
/"normalize-space"
/"translate"

//4.3 Boolean Functions
BooleanFunctions
= "boolean"
/"not"
/"true"
/"false"
/"lang"

//4.4 Number Functions
NumberFunctions
= "number"
/"sum"
/"floor"
/"ceiling"
/"round"


//5 XForms Core Functions
XFormsCoreFunctions
= XFormsFunctionsChangeContext
/XFormsFunctions

XFormsFunctionsChangeContext
="instance"

XFormsFunctions
="avg"
/"boolean-from-string"
/"count-non-empty"
/"days-from-date"
/"if"
/"index"
/"max"
/"min"
/"months"
/"now"
/"property"
/"seconds"
/"seconds-from-dateTime"




//Common Syntactic Constructs
_ =[ \t\n\r]+

David Majda

unread,
Jul 24, 2010, 7:39:16 AM7/24/10
to pe...@googlegroups.com
Hi,

2010/7/23 CG <christop...@gmail.com>:


> The problem, I am encountering at this stage is that the grammar
> processor is detecting recursions in the grammar (the first on is on
> RelativeLocationPath).
>
> Is there a way to handle this ? I think the grammar is correctly
> transcripted, but I am totally new to PEG in general, and might miss
> something trivial.

the parser generated by PEG.js has the semantics of a recursive
descent parser. This means that if you used left recursive rules such
as "A = 'a' / A 'a'" in the grammar, the parser would end up in an
infinite cycle -- it would try to match rule A, but for doing that, it
would need to match rule A again, etc. The recursion can span multiple
rules, in which case it is called indirect left recursion. You can
read more about these things on Wikipedia [1] [2] [3].

Because a parser built from a left recursive grammar can end up in an
infinite cycle, PEG.js checks the grammar for both direct and indirect
recursion and returns an error if it finds such cases.

It is possible to mechanically convert any left-recursive grammar to a
right-recursive one. The theoretical algorithm is little bit tricky,
but often it is not needed in its full power -- in most cases, the
left recursion is used to encode arithmetic expressions or lists in
the grammar, and both can be usually rewritten easily. Moreover, with
PEG repetition operators, you can often avoid recursion at all.

For example, consider the RelativeLocationPath rule you mentioned:

RelativeLocationPath
= Step
/ RelativeLocationPath '/' Step
/ AbbreviatedRelativeLocationPath

Together with

AbbreviatedRelativeLocationPath
= RelativeLocationPath '//' Step

it basically says that RelativeLocationPath is a list of Steps
separated by "/" or "//". Once you have deciphered this meaning, the
rule can be easily rewritten into the following one:

RelativeLocationPath
= Step (("/" / "//") Step)*

Simple, isn't it?

Most grammars in various specifications are written for LR parsers and
use left recursion extensively, so certain amount of rewriting is
usually needed before using such a grammar in PEG based tools. To see
more examples of these rewrites, look into the example JavaScript
grammar in PEG.js [4] and compare it with the original grammar in the
ECMAScript specification [5].

David

[1] http://en.wikipedia.org/wiki/Recursive_descent_parser
[2] http://en.wikipedia.org/wiki/Left_recursion
[3] http://en.wikipedia.org/wiki/Parsing_expression_grammar
[4] http://github.com/dmajda/pegjs/blob/master/examples/javascript.pegjs
[5] http://www.ecma-international.org/publications/standards/Ecma-262.htm

--
Everyone gets everything he wants. --Captain Willard in Apocalypse Now

Personal :: da...@majda.cz              :: www.majda.cz
Work     :: david...@impaladesign.cz :: www.impaladesign.cz

CG

unread,
Jul 25, 2010, 8:58:27 AM7/25/10
to PEG.js: Parser Generator for JavaScript
Dear David
Thanks a lot for your expert time and detailed explanations. It is
very useful.

So if I understand correctly, it is possible in principle to rewrite
the XPath grammar so as to avoid recursions. I will work on that ; )

Just one question re performance of peg.js parser. Would would think/
guess that parsing about one hundred of expressions like with the
below grammar:

instance('data')/main/indicator/plans/parameters/useCategory

could be realistically processed at run-time (e.g. below one second) ?

Cheers
C.

On Jul 24, 1:39 pm, David Majda <da...@majda.cz> wrote:
> Hi,
>
> 2010/7/23 CG <christophe.gei...@gmail.com>:
> Work     :: david.ma...@impaladesign.cz ::www.impaladesign.cz

David Majda

unread,
Aug 2, 2010, 2:56:48 AM8/2/10
to pe...@googlegroups.com
Hi,

sorry for a late answer.

2010/7/25 CG <christop...@gmail.com>:


> Just one question re performance of peg.js parser. Would would think/
> guess that parsing about one hundred of expressions like with the
> below grammar:
>
> instance('data')/main/indicator/plans/parameters/useCategory
>
> could be realistically processed at run-time (e.g. below one second) ?

you are talking about parsing speed around 5-10 kB/s. With most
grammars, PEG.js should be able to achieve that speed without
problems.

You can run the benchmark included in PEG.js [1] on your machine to
get an idea how fast PEG.js is. On my machine, it runs at ~ 30 kB/s
(but at ~ 100 kB/s for parsing CSS only). This is all on V8.

I want to focus more on performance in PEG.js 0.6, so stay tuned.

David Majda

[1] http://github.com/dmajda/pegjs/tree/master/benchmark/

--
Everyone gets everything he wants. --Captain Willard in Apocalypse Now

Personal :: da...@majda.cz              :: www.majda.cz
Work     :: david...@impaladesign.cz :: www.impaladesign.cz

Reply all
Reply to author
Forward
0 new messages