http://code.google.com/p/json-pattern/
JSON Template takes a dictionary -> string, and this takes a string ->
dictionary. You can describe it roughly as annotating a big regular
expression with a (JSON) tree structure.
There are a bunch of things that need to be polished -- the main issue
is making the syntax as readable as possible. I've gone through about
3 iterations there, with some more tweaks to make. Suggestions
welcome.
Simple example of parsing "ls -al" (subpattern definitions omitted
here, the next 2 links have them):
http://chubot.org/json-pattern/test-cases/testLs_NewSyntax.html
Mini tutorial that explains the parts:
http://chubot.org/json-pattern/test-cases/testMiniTutorial.html
Parsing a big Perforce change description (from Google's open source
work; scroll to the end for the big pattern, and a nice hierarchical
structure):
http://chubot.org/json-pattern/test-cases/testFullChangeDesc_NewSyntax.html
Summary
* Like JSON Template, it's meant to be a language-independent specification
* Can be built on top of any regex engine, particularly
JavaScript's relatively weak one
* API is data, rather than a procedural API
* ~1000 lines of code, so it can be ported easily, but still powerful
* A well-defined (and fast) execution model
* Readable syntax (still improving here). Regular expressions are
very powerful, but hobbled by their obscure and inconsistent syntax.
* A small number of orthogonal concepts
* Blocks (e.g. for expressing repeated capture)
* Filters (extensible through host language)
* Subpatterns (a pattern reuse mechanism)
* Composes with other components
* The interpreter implements a binary operator (I think of it like ~=)
* You can easily imagine a pipeline of text -> JSON Pattern ->
structured data manipulation -> JSON Template -> text
What does this add over regular expressions?
* The ability to capture named, hiearchical data structures. Regular
expressions can only capture flat data, and in some engines like
JavaScript, the data can't be named.
* Can capture integers and booleans, not just strings, via filters.
* Reuse of regular expressions. This is fairly common in practice,
e.g. when writing ad hoc lexers.
* More readable syntax, using line prefixes.
Applications
* Exposing system stats from command line tools over the network, e.g.
web services for system administration
* Quick and dirty parsing of some network formats, like DNS, HTTP headers, etc.
* Parsing little languages like *itself* and JSON Template. This
should be possible, since there are no operators with precedence and
such.
Caveats
* In most cases you wouldn't use this for HTML scraping. For HTML
scraping, you want something that knows about the tree structure of
the document, like jQuery's selector language.
TODO
* Allow filters to stop the match by returning None
* Subpatterns can also be filters, for structure refinement! (both
are functions from string -> JSON). Like the "Templates as
Formatters" idea, this language turned out to be unexpectedly rich.
* Perhaps allow hooks for executing procedural code, not just filters
(Perl does this in a messy way).
* Embedding a library of patterns in "JSON Config"
* Need lots of docs!
* Code cleanup, test cleanup
A bit late to respond, but thanks for this new interesting idea!
Regards,
Martijn
Thanks. There are a couple more interesting examples I came up with.
The first is parsing and evaluating a C-style \-escaped string (used
in JSON, Python, Protocol Buffers, etc.):
http://code.google.com/p/json-pattern/source/browse/tests/examples_test.py
There's a bit of unnecessary junk in that test, but the idea is that
you can split it up into manageable expressions rather than write a
huge regex, and you can also evaluate the escapes inline with filters.
Compare with the Python stdlib, in tokenize.py, they use the "huge
ungainly regex" approach.
It might be faster too -- I have to time it -- because alternation in
regexes (|) can cause unnecessary backtracking, and in this case you
want to just look forward. As noted in the Friedl book on regexes,
sometimes it's faster to split them up.
The other one is more of a toy example, and RPN calculator:
http://code.google.com/p/json-pattern/source/browse/tests/rpn_test.py
I compared that with this:
http://daniel.carrera.bz/2009/06/rpn-calculator-in-perl-6/
I think it's a lot more straightforward and elegant, but neither
example is really all that impressive compared to the most
straightforward possible Python implementation, which is like 10 lines
: )
Andy