PEG Parser

529 views
Skip to first unread message

Abe Schneider

unread,
May 25, 2014, 10:28:45 PM5/25/14
to julia...@googlegroups.com
I wrote a quick PEG Parser for Julia with Packrat capabilities:

https://github.com/abeschneider/PEGParser

It's a first draft and needs a ton of work, testing, etc., but if this is of interest to anyone else, here is a quick description.

Grammars can be defined using most of the standard EBNF syntax. For example, a simple math grammar can be defined as:

@grammar mathgrammar begin
 start = expr
  number
= r"([0-9]+)"
 expr = (term + op1 + expr) | term
 term = (factor + op2 + term) | factor
 factor = number | pfactor
 pfactor = ('(' + expr + ')')
 op1 = '+' | '-'
 op2 = '*' | '/'
end


To parse a string with the grammar:

(node, pos, error) = parse(mathgrammar, "5*(2-6)")

This will create an AST which can then be transformed to a value. Currently this is accomplished by doing:

math = Dict()
math["number"] = (node, children) -> float(node.value)
math["expr"] = (node, children) ->
   length(children) == 1 ? children : eval(Expr(:call, children[2], children[1], children[3]))
math["factor"] = (node, children) -> children
math["pfactor"] = (node, children) -> children[2]
math["term"] = (node, children) ->
   length(children) == 1 ? children : eval(Expr(:call, children[2], children[1], children[3]))
math["op1"] = (node, children) -> symbol(node.value)
math["op2"] = (node, children) -> symbol(node.value)

Ideally, I would like to simplify this to using multi-dispatch on symbols (see previous post), but for now this is the easiest way to define actions based on node attributes.

Finally, to transform the tree:

result = transform(math, node)  # will give the value of 20

Originally I was going to attach the transforms to the rules themselves (similiar to boost::spirit). However, there were two reasons for not doing this:
  1. To implement the packrat part of the parser, I needed to cache the results which meant building an AST anyways
  2. It's nice to be apply to get different transforms for the same grammar (e.g. you may want to transform the result into HTML, LaTeX, etc.)

The downside of the separation is that it adds some more complexity to the process.

harven

unread,
May 26, 2014, 3:41:26 AM5/26/14
to julia...@googlegroups.com
Nice!

If you are interested by testing your library on a concrete problem, you may want to parse comma separated value (csv) files. The bnf is in the specification RFC4180. http://tools.ietf.org/html/rfc4180

AFAIK, the readcsv function provided in Base does not handle quotations well whereas the csv parser in DataFrames is slow, so that julia does not have yet a native efficient way to parse csv files.

Andreas Lobinger

unread,
May 26, 2014, 5:16:18 AM5/26/14
to julia...@googlegroups.com
Hello colleagues,

just as i read it:


On Monday, May 26, 2014 9:41:26 AM UTC+2, harven wrote:
If you are interested by testing your library on a concrete problem, you may want to parse comma separated value (csv) files. The bnf is in the specification RFC4180. http://tools.ietf.org/html/rfc4180

Miles Gould

unread,
May 26, 2014, 4:35:52 PM5/26/14
to julia...@googlegroups.com
Nice indeed! We'd also welcome a parser for the dot format in Graphs.jl :-)

https://en.wikipedia.org/wiki/DOT_%28graph_description_language%29

Miles

John Myles White

unread,
May 26, 2014, 7:44:37 PM5/26/14
to julia...@googlegroups.com
I’d be interested to see how far RFC4180 actually gets you. My guess is that it doesn’t get you very far.

 — John

Abe Schneider

unread,
May 27, 2014, 6:49:57 PM5/27/14
to julia...@googlegroups.com
I don't know how the speed of the parser will be compared to DataFrames -- I've done absolutely no work to date on profiling the code, but I thought writing a CSV parser was a good way to test out code (and helped find a bunch of bugs).

I've also committed (under examples/) the CSV parser. The grammar (from the RFC) is:

@grammar csv begin
  start
= data
  data
= record + *(crlf + record)
  record
= field + *(comma + field)
  field
= escaped_field | unescaped_field
  escaped_field
= dquote + *(textdata | comma | cr | lf | dqoute2) + dquote
  unescaped_field
= textdata
  textdata
= r"[ !#$%&'()*+\-./0-~]+"
  cr
= '\r'
  lf
= '\n'
  crlf
= cr + lf
  dquote
= '"'
  dqoute2
= "\"\""
  comma
= ','
end

and the actions are:

tr["crlf"] = (node, children) -> nothing
tr
["comma"] = (node, children) -> nothing

tr
["escaped_field"] = (node, children) -> node.children[2].value
tr
["unescaped_field"] = (node, children) -> node.children[1].value
tr
["field"] = (node, children) -> children
tr
["record"] = (node, children) -> unroll(children)
tr
["data"] = (node, children) -> unroll(children)
tr
["textdata"] = (node, children) -> node.value


give the data:

parse_data = """1,2,3\r\nthis is,a test,of csv\r\n"these","are","quotes ("")""""

and running the parser:

(node, pos, error) = parse(csv, parse_data)
result
= transform(tr, node)

I get:

{{"1","2","3"},{"this is","a test","of csv"},{"these","are","quotes (\"\")"}}

Abe Schneider

unread,
May 27, 2014, 6:53:23 PM5/27/14
to julia...@googlegroups.com
I'll take a look. I found the full grammar here:

http://www.graphviz.org/doc/info/lang.html

Unfortunately I have very little free time (most of the work was done during vacation), but it shouldn't be too difficult to implement.

John Myles White

unread,
May 27, 2014, 6:58:42 PM5/27/14
to julia...@googlegroups.com
I'd be really interested to see how this parser compares with DataFrames. There's a bunch of test files in the DataFrames.jl/test directory.

 -- John

Jameson Nash

unread,
May 27, 2014, 7:49:06 PM5/27/14
to julia...@googlegroups.com
some possible alternatives to using eval include:
macroexpand
apply
getfield

depending on your use-case

Jameson Nash

unread,
May 27, 2014, 8:26:39 PM5/27/14
to julia...@googlegroups.com
Finally, one thing that I would like to change in the near future is to have transforms look something like:
html(node, children, :bold_open) = "<b>"
html(node, children, :bold_close) = "</b>"
html(node, children, :text) = node.value
html(node, children, :bold_text) = join(children)
result = transform(html, node)
If Julia gets dispatch on value, then this would be trivial to write. One possible workaround is to create a type per rule in the grammar. Then the functions can be written to dispatch on the type associated with the given rule.
(from your package readme)

type Node{T} end
html{T}(node, children, ::Node{T}) = error("No rule to make $T into html")
html(node, children, ::Node{ :bold_open }) = "<b>"
html(node, children, ::Node{ :bold_close }) = "</b>"

Abe Schneider

unread,
Jun 4, 2014, 11:01:33 PM6/4/14
to julia...@googlegroups.com
After playing around with a bunch of alternatives, I think I've come up with decent action semantics:

@transform <name> begin
 
<label> = <action>
end

For example, a simple graph grammar might look like:

@grammar nodetest begin
  start
= +node_def
  node_def
= node_label + node_name + lbrace + data + rbrace
  node_name
= string_value + space

  data
= *(line + semicolon)
  line
= string_value + space
  string_value
= r"[_a-zA-Z][_a-zA-Z0-9]*"

  lbrace
= "{" + space
  rbrace
= "}" + space
  semicolon
= ";" + space
  node_label
= "node" + space
  space
= r"[ \t\n]*"
end

with it's actions to create some data structure:

type MyNode
  name
  values

 
function MyNode(name, values)
   
new(name, values)
 
end
end


with:
@transform tograph begin
 
# ignore these
  lbrace
= nothing
  rbrase
= nothing
  semicolon
= nothing
  node_label
= nothing
  space
= nothing

 
# special action so we don't have to define every label
 
default = children

  string_value
= node.value
  value
= node.value
  line
= children
  data
= MyNode("", children)
  node_def
= begin
   
local name = children[1]
   
local cnode = children[2]
    cnode
.name = name
   
return cnode
 
end
end

and finally, to apply the transform:

(ast, pos, error) = parse(nodetest, data)
result
= apply(tograph, ast)
println
(result)    # {MyNode("foo",{"a","b"}),MyNode("bar",{"c","d"})}

The magic in '@transform' basically just creates the dictionary like before, but automatically wraps the expression on the RHS  as an anonymous function  (node, children) -> expr.

I'm currently looking for a better name than 'children', as it's potentially confusing and misleading. It's actually the values of the child nodes (as opposed to node.children). Maybe cvalues?

Abe Schneider

unread,
Jun 4, 2014, 11:03:31 PM6/4/14
to julia...@googlegroups.com
I'll try to get around to comparing against the DataFrames version and profiling this week. I got stuck trying to figure out the action semantics.

Abe Schneider

unread,
Jun 5, 2014, 6:56:37 AM6/5/14
to julia...@googlegroups.com
I also forgot to push the changes last night.

Abe Schneider

unread,
Jul 4, 2014, 2:09:26 PM7/4/14
to julia...@googlegroups.com
I got sidetracked by a couple of other things, but the parser is now updated with a bunch of bug fixes. I have a preliminary CSV and graphdot parser (very reduced from the full grammar). I'm starting to put together some more comprehensive tests together.

As for speed comparison to DataFrames for parsing CSV it's much slower. I've spent time trying to optimize things, but I suspect a large part of the speed issue is the overhead of function calls. Also, I suspect it will be hard to come close to the speed of DataFrames as the code looks like it's fairly optimized for reading just CSV files.

After a few more tests are written, I'm getting ready to officially call a version 0.1.

John Myles White

unread,
Jul 6, 2014, 9:08:40 PM7/6/14
to julia...@googlegroups.com
Thanks for looking into this, Abe. That’s too bad that the CSV parser is much slower than the hand-crafted one. PEG seems like a great tool for tasks where maximum performance isn’t as important.

 — John

Abe Schneider

unread,
Jul 13, 2014, 10:27:22 AM7/13/14
to julia...@googlegroups.com
I agree, I think PEG grammars are likely not well-suited to parsing large amounts of data (especially where a hand-crafted parser can be created). That said, I am curious as to why the speed difference exists. In the case of CSV, I think the packrat caching mechanism probably doesn't help, but disabling that is still slow.

The next logical point of slowness is the number of function-calls required. I know that the multi-dispatch has some added penalty over single-dispatch, but I'm not sure by how much. It's possible that I could rewrite some of the CSV grammar to make less calls by either rewriting some of the rules or by using more regular expressions.

Finally, the last likely place for the slow-down is the fact that PEGParser has to create the AST as it goes along. This is a huge overhead that the DataFrames CSV parser does not need to do since it's a single-purpose parser.

I'm going to continue work on the Dot parser as I think it's probably more useful (and more difficult to write by hand).

Also, I've created a pull request to add PEGParser to the official package list. If it's something that interests other people, it would be great to have other people involved.

On Sunday, July 6, 2014 9:08:40 PM UTC-4, John Myles White wrote:ly,,
Reply all
Reply to author
Forward
0 new messages