[QUIZ] Parsing JSON (#155)

Ruby Quiz

unread,

Feb 1, 2008, 8:55:39 AM2/1/08

to

The three rules of Ruby Quiz:

1. Please do not post any solutions or spoiler discussion for this quiz until
48 hours have passed from the time on this message.

2. Support Ruby Quiz by submitting ideas as often as you can:

http://www.rubyquiz.com/

3. Enjoy!

Suggestion: A [QUIZ] in the subject of emails about the problem helps everyone
on Ruby Talk follow the discussion. Please reply to the original quiz message,
if you can.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

There has been a lot of talk recently about parsing with Ruby. We're seeing
some parser generator libraries pop up that make the task that much easier and
they've been stirring up interest.

In honor of that, this week's Ruby Quiz is to write a parser for JSON.

JSON turns out to turns out to be a great little example for writing parsers for
two reasons. First, it's pretty easy stuff. You can hand-roll a JSON parser in
under 100 lines of Ruby. The second advantage is that the data format is
wonderfully documented:

http://json.org/

Since JSON is just a data format and Ruby supports all of the data types, I vote
we just use Ruby itself as the abstract syntax tree produced by the parse.

Feel free to show off your favorite parser generator, if you don't want to roll
your own. Anything goes.

Here are a few tests to get you started:

require "test/unit"

class TestJSONParser < Test::Unit::TestCase
def setup
@parser = JSONParser.new
end

def test_keyword_parsing
assert_equal(true, @parser.parse("true"))
assert_equal(false, @parser.parse("false"))
assert_equal(nil, @parser.parse("null"))
end

def test_number_parsing
assert_equal(42, @parser.parse("42"))
assert_equal(-13, @parser.parse("-13"))
assert_equal(3.1415, @parser.parse("3.1415"))
assert_equal(-0.01, @parser.parse("-0.01"))

assert_equal(0.2e1, @parser.parse("0.2e1"))
assert_equal(0.2e+1, @parser.parse("0.2e+1"))
assert_equal(0.2e-1, @parser.parse("0.2e-1"))
assert_equal(0.2E1, @parser.parse("0.2e1"))
end

def test_string_parsing
assert_equal(String.new, @parser.parse(%Q{""}))
assert_equal("JSON", @parser.parse(%Q{"JSON"}))

assert_equal( %Q{nested "quotes"},
@parser.parse('"nested \"quotes\""') )
assert_equal("\n", @parser.parse(%Q{"\\n"}))
assert_equal( "a",
@parser.parse(%Q{"\\u#{"%04X" % ?a}"}) )
end

def test_array_parsing
assert_equal(Array.new, @parser.parse(%Q{[]}))
assert_equal( ["JSON", 3.1415, true],
@parser.parse(%Q{["JSON", 3.1415, true]}) )
assert_equal([1, [2, [3]]], @parser.parse(%Q{[1, [2, [3]]]}))
end

def test_object_parsing
assert_equal(Hash.new, @parser.parse(%Q{{}}))
assert_equal( {"JSON" => 3.1415, "data" => true},
@parser.parse(%Q{{"JSON": 3.1415, "data": true}}) )
assert_equal( { "Array" => [1, 2, 3],
"Object" => {"nested" => "objects"} },
@parser.parse(<<-END_OBJECT) )
{"Array": [1, 2, 3], "Object": {"nested": "objects"}}
END_OBJECT
end

def test_parse_errors
assert_raise(RuntimeError) { @parser.parse("{") }
assert_raise(RuntimeError) { @parser.parse(%q{{"key": true false}}) }

assert_raise(RuntimeError) { @parser.parse("[") }
assert_raise(RuntimeError) { @parser.parse("[1,,2]") }

assert_raise(RuntimeError) { @parser.parse(%Q{"}) }
assert_raise(RuntimeError) { @parser.parse(%Q{"\\i"}) }

assert_raise(RuntimeError) { @parser.parse("$1,000") }
assert_raise(RuntimeError) { @parser.parse("1_000") }
assert_raise(RuntimeError) { @parser.parse("1K") }

assert_raise(RuntimeError) { @parser.parse("unknown") }
end
end

Eric Mahurin

unread,

Feb 1, 2008, 11:09:05 AM2/1/08

to

[Note: parts of this message were removed to make it a legal post.]

On Feb 1, 2008 7:55 AM, Ruby Quiz <ja...@grayproductions.net> wrote:

> In honor of that, this week's Ruby Quiz is to write a parser for JSON.
>
> JSON turns out to turns out to be a great little example for writing
> parsers for
> two reasons. First, it's pretty easy stuff. You can hand-roll a JSON
> parser in
> under 100 lines of Ruby. The second advantage is that the data format is
> wonderfully documented:
>
> http://json.org/
>

I definitely want to find time to do this one. What would be nice to have
is performance benchmark to compare parsers. Maybe just have a little ruby
script that generates a stream of repeatable random (but valid) JSON.

Eric

James Gray

unread,

Feb 1, 2008, 11:12:54 AM2/1/08

to

Neat idea.

Just FYI though, I'm probably going to focus more on the parsing in
the summary that the speed.

James Edward Gray II

Eric Mahurin

unread,

Feb 1, 2008, 11:21:44 AM2/1/08

to

[Note: parts of this message were removed to make it a legal post.]

OK. Once I figure out exactly what JSON is, I'll probably make a random
JSON generator.

Hopefully this will make me do another release of my parser package, which
is long overdue :). At least check-in my local code into CVS.

Trans

unread,

Feb 1, 2008, 11:23:37 AM2/1/08

to

A bit aside, but it seems a good place to plug the thought: JSON is so
close to valid Ruby syntax. It would be great if Ruby could support
the syntax 100%. Then a parse would be as simple as,

data = eval(json)

Or, safety levels withstanding, we could conceive a safe_eval(json).

T.

James Gray

unread,

Feb 1, 2008, 11:28:44 AM2/1/08

to

On Feb 1, 2008, at 10:23 AM, Trans wrote:

> A bit aside, but it seems a good place to plug the thought: JSON is so
> close to valid Ruby syntax. It would be great if Ruby could support
> the syntax 100%.

Conversion is pretty easy and definitely one way to solve this quiz.

James Edward Gray II

Petite Abeille

unread,

Feb 1, 2008, 11:34:53 AM2/1/08

to

On Feb 1, 2008, at 5:23 PM, Trans wrote:

> A bit aside, but it seems a good place to plug the thought: JSON is so
> close to valid Ruby syntax. It would be great if Ruby could support
> the syntax 100%. Then a parse would be as simple as,
>
> data = eval(json)

Brilliant!

But perhaps the other way around: bridge the JSON syntax discrepencies
to valid Ruby syntax, e.g:

eval( to_ruby( json ) )

Cheers,

PA.

ara howard

unread,

Feb 1, 2008, 12:20:33 PM2/1/08

to

On Feb 1, 2008, at 9:28 AM, James Gray wrote:

> Conversion is pretty easy and definitely one way to solve this quiz.

not to mention installing one of the three gem json parsers out
there ;-)

a @ http://codeforpeople.com/
--
it is not enough to be compassionate. you must act.
h.h. the 14th dalai lama

James Gray

unread,

Feb 1, 2008, 12:30:01 PM2/1/08

to

On Feb 1, 2008, at 11:20 AM, ara howard wrote:

>
> On Feb 1, 2008, at 9:28 AM, James Gray wrote:
>
>> Conversion is pretty easy and definitely one way to solve this quiz.
>
> not to mention installing one of the three gem json parsers out
> there ;-)

That's so cheating Ara. :D

James Edward Gray II

ara howard

unread,

Feb 1, 2008, 12:30:26 PM2/1/08

to

On Feb 1, 2008, at 9:09 AM, Eric Mahurin wrote:

> Maybe just have a little ruby
> script that generates a stream of repeatable random (but valid) JSON.

cfp2:~ > cat a.rb
require 'rubygems'
require 'json'

def random_json
case rand
when 0 ... 1/3.0
top = Hash.new
add = lambda{|obj| top[obj] = obj}
when 1/3.0 ... 2/3.0
top = Array.new
add = lambda{|obj| top.push obj}
when 2/3.0 .. 1
top = String.new
add = lambda{|obj| top += obj}
end
10.times{ add[rand.to_s] }
top.to_json
end

puts random_json

cfp2:~ > for i in `seq 1 3`;do ruby a.rb ;done
"0.3786779826911330.2475380034343990.7052927081471540.2056530009384740.1367079874315110.6433874613518640.5329060341883540.8932613322492760.9233991888762390.561470121133217
"
{"0.758942077040095":"0.758942077040095","0.740998718448961":"0.740998718448961","0.581975309640819":"0.581975309640819","0.471066491788047":"0.471066491788047","0.150752108985123":"0.150752108985123","0.679712508205116":"0.679712508205116","0.265444532310993":"0.265444532310993","0.43229805237576":"0.43229805237576","0.880407977937905":"0.880407977937905","0.91896885679168":"0.91896885679168"}
["0.140526101058637
","0.647296447390116
","0.419874655921874
","0.67320818546074
","0.847043108967541
","0.479385904117001
","0.378678170026127
","0.707315391952609","0.26064520446906","0.460184583302929"]

a @ http://codeforpeople.com/
--

we can deny everything, except that we have the possibility of being
better. simply reflect on that.

tho_mica_l

unread,

Feb 1, 2008, 1:10:05 PM2/1/08

to

> A bit aside, but it seems a good place to plug the thought: JSON is so
> close to valid Ruby syntax. It would be great if Ruby could support
> the syntax 100%.

I hoped ruby19 would already support this use of colons as in {"key":
"value"} but unfortunately not.

Sebastian Hungerecker

unread,

Feb 1, 2008, 1:51:53 PM2/1/08

to

tho_mica_l wrote:
> I hoped ruby19 would already support this use of colons as in {"key":
> "value"} but unfortunately not.

It's key: value
>> {key: "value"}
=> {:key => "value"}

HTH,
Sebastian
--
NP: Eisregen - Herzblut
Jabber: sep...@jabber.org
ICQ: 205544826

tho_mica_l

unread,

Feb 1, 2008, 2:06:04 PM2/1/08

to

> It's key: value>> {key: "value"}
>
> => {:key => "value"}

I see. Thanks for pointing this out.

Still some conversion needed for the quiz.

BTW, the ruby19 json library, which Ara mentioned, also has a parse
method which I suppose you, uhm, know of? I think the use of this
should be ... well, officially disencouraged in the context of this
quiz maybe. Since Ara has already mentioned this library, I hope this
isn't news for you.

James Gray

unread,

Feb 1, 2008, 2:15:46 PM2/1/08

to

On Feb 1, 2008, at 1:09 PM, tho_mica_l wrote:

> BTW, the ruby19 json library, which Ara mentioned, also has a parse
> method which I suppose you, uhm, know of?

I suspect the number of us using 1.9 exclusively is still pretty
small, so I don't focus too much on it when writing quizzes. I knew
it had a JSON library though, yes.

> I think the use of this should be ... well, officially disencouraged
> in the context of this quiz maybe.

I agree. It's cheating too. :)

James Edward Gray II

-a

unread,

Feb 1, 2008, 2:37:58 PM2/1/08

to

>
> I suspect the number of us using 1.9 exclusively is still pretty
> small, so I don't focus too much on it when writing quizzes. I knew
> it had a JSON library though, yes.
>
> > I think the use of this should be ... well, officially disencouraged
> > in the context of this quiz maybe.
>
> I agree. It's cheating too. :)

yeah - obviously cheating. just to clarify though, for people who
might actually want to use json that this is not a 1.9 thing: it's on
rubyforge:

cfp2:~ > gem list --remote|grep json
fjson (0.1.2, 0.1.1, 0.1.0, 0.0.9, 0.0.8, 0.0.7, 0.0.6, 0.0.5, 0.0.4,
0.0.3, 0.0.2, 0.0.1)
json (1.1.2, 1.1.1, 1.1.0, 1.0.4, 1.0.3, 1.0.2, 1.0.1, 1.0.0, 0.4.3,
0.4.2, 0.4.1, 0.4.0)
json_pure (1.1.2, 1.1.1, 1.1.0, 1.0.4, 1.0.3, 1.0.2, 1.0.1, 1.0.0)
Orbjson (0.0.4, 0.0.3, 0.0.2, 0.0.1)
ruby-json (1.1.2, 1.1.1)

and activesupport also includes a parser

regards.

Trans

unread,

Feb 1, 2008, 3:44:48 PM2/1/08

to

as does blow.

T.

Robert Dober

unread,

Feb 1, 2008, 6:33:31 PM2/1/08

to

I do not know why I have the impression that you want make it easier
for James to leave ;).

The idea giving some show on parsing ( never really done AFAIR in a
Ruby Quiz) is a nice one!!!

Now a translation that would eval is indeed an interesting idea
especially as I am almost sure that it would be parsing too.
AFAIK Json can be read by Javascript natively (surprisingly) what
about implementing Javascript in Ruby ;)

Cheers
Robert

ara howard

unread,

Feb 1, 2008, 11:23:26 PM2/1/08

to

On Feb 1, 2008, at 4:33 PM, Robert Dober wrote:

> I do not know why I have the impression that you want make it easier
> for James to leave ;).

doh - didn't want to give *that* impression!

a @ http://drawohara.com/
--
sleep is the best meditation.

steve

unread,

Feb 3, 2008, 9:14:38 AM2/3/08

to

[Note: parts of this message were removed to make it a legal post.]

Hey guys

This is my first parser. I used Nathan Sobo's Treetop parsing library (
http://treetop.rubyforge.org/, gem install treetop):

http://pastie.caboo.se/146906

require 'treetop'

File.open("json.treetop", "w") {|f| f.write GRAMMAR }

Treetop.load "json"

parser = JsonParser.new

pp parser.parse(STDIN.read).value if $0 == __FILE__

BEGIN {

GRAMMAR = %q{

grammar Json
rule json
space json_value space { def value; json_value.value; end }
end

rule json_value
string / numeric / keyword / object / array
end

rule string
'"' chars:char* '"' {
def value
chars.elements.map {|e| e.value }.join

end
}
end

rule char
!'"' ('\\\\' ( ( [nbfrt"] / '\\\\' / '/' ) / 'u' hex hex hex hex )
/ !'\\\\' .) {

def value
if text_value[0..0] == '\\\\'
case c = text_value[1..1]
when /[nbfrt]/

{'n' => "\n", 'b' => "\b", 'f' => "\f", 'r' => "\r", 't' => "\t"}[c]
when 'u'

[text_value[2,4].to_i(16)].pack("L").gsub(/\0*$/,'')
else
c
end

else
text_value
end
end
}
end

rule hex
[0-9a-fA-F]
end

rule numeric
exp / float / integer
end

rule exp
(float / integer) ('e' / 'E') ('+' / '-')? integer { def value;
text_value.to_f; end }

end

rule float
integer '.' [0-9]+ { def value; text_value.to_f; end }
end

rule integer
'-'? ('0' / [1-9] [0-9]*) { def value; text_value.to_i; end }
end

rule keyword
('true' / 'false' / 'null') {
def value

{ 'true' => true, 'false' => false, 'null' => nil }[text_value]
end
}
end

rule object
'{' space pairs:pair* space '}' {
def value

pairs.elements.map {|p| p.value }.inject({}) {|h,p| h.merge p }
end
}
end

rule pair
space string space ':' space json_value space (',' &pair / !pair) {
def value
{ string.value => json_value.value }

end
}
end

rule array

'[' space array_values:array_value* space ']' {
def value
array_values.elements.map {|e| e.value }

end
}
end

rule array_value
space json_value space (',' &array_value / !array_value) {

def value
json_value.value
end
}
end

rule space
[ \t\r\n]*
end

end

}

- steve

Justin Ethier

unread,

Feb 3, 2008, 10:12:46 AM2/3/08

to

[Note: parts of this message were removed to make it a legal post.]

Here is my solution. I do a first pass to tokenize the input and perform
basic syntax checks. Then the expression is fully converted into ruby syntax
and eval is used to load it into Ruby. It passes each of the test cases,
although some improvements could still be made.

class JSONParser
# Parse a given JSON expression
def parse(expr)
# Tokenize the input
tokens = lex(expr)

# Load the expression into ruby
# Takes advantage of the fact ruby syntax is so close to that of JSON.
# However, it would be nice to have a safe_eval to prevent against
potential injection attacks
begin
eval(ruby_convert(tokens))
rescue SyntaxError, NameError
raise RuntimeError
end
end

# Converts tokens into a single ruby expression
def ruby_convert(tokens)
expr = ""
for token in tokens
token = "=>" if token == ":" # Ruby hash syntax
token = "nil" if token == "null"
expr += token
end
expr
end

# Parses the input expression into a series of tokens
# Performs some limited forms of conversion where necessary
def lex(expr)
tokens = []
i = -1
while i < expr.size - 1
tok ||= ""
i += 1

case expr[i].chr
when '[', ']', '{', '}', ':', ','
tokens << tok if tok.size > 0
tokens << expr[i].chr
tok = ""
# String processing
when '"'
raise "Unexpected quote" if tok.size > 0
len = 1
escaped = false
while (len + i) < expr.size
break if expr[len + i].chr == '"' and not escaped
if escaped
case expr[len + i].chr
when '"', '/', '\\', 'b', 'f', 'n', 'r', 't', 'u'
else
raise "Unable to escape #{expr[len + i].chr}"
end
end
escaped = expr[len + i].chr == "\\"
len += 1
end
raise "No matching endquote for string" if (len + i) > expr.size
tokens << convert_unicode(expr.slice(i, len+1))
i += len
# Number processing
when '-', /[0-9]/
len = 0
while (len + i) < expr.size and /[0-9eE+-.]/.match(expr[len +
i].chr)!= nil
len += 1
end
num = expr.slice(i, len)

# Verify syntax of the number using the JSON state machine
raise "Invalid number #{num}" if
/[-]?([1-9]|(0\.))[0-9]*[eE]?[+-]?[0-9]*/.match(num) == nil

tokens << num
i += len - 1
# Skip whitespace
when ' ', '\t'
else
tok << expr[i].chr
end
end
tokens << tok if tok.size > 0
tokens
end

# Convert unicode characters from hex (currently only handles ASCII set)
def convert_unicode(str)
while true
u_idx = str.index(/\\u[0-9a-fA-F]{4}/)
break if u_idx == nil

u_str = str.slice(u_idx, 6)
str.sub!(u_str, u_str[2..5].hex.chr)
end
str
end
end

Thanks,

Justin

tho_mica_l

unread,

Feb 3, 2008, 11:23:46 AM2/3/08

to

I liked this quiz because it made me look into treetop, ragel and
some
other libraries I wanted to examine a little bit closer for quite a
while now. Anyway, for my (official) solution I took the easy road
and
rely on ruby to do the actual work.

My solution is ruby19 only, since I used the opportunity to explore
some
of the new regexp features. Since it uses eval(), there is a
possibility
for ruby code injection like the ultimatively relieving "#{`sudo rm -
rf
/`}". I think my solution catches such attacks though.

BTW, what do you all think should be the canonic output of the
following
JSON snippet:

json1 = <<JSON
{"a":2,"b":3.141,"TIME":"2007-03-14T11:52:40","c":"c","d":[1,"b",
3.14],"COUNT":666,"e":{"foo":"bar"},"foo":"B\\u00e4r","g":"\\u677e\
\u672c\\u884c\\u5f18","h":1000.0,"bar":"\\u00a9 \\u2260 \\u20ac!","i":
0.001,"j":"\\ud840\\udc01"}
JSON

I get conflicting results between various versions of my solution and
the official ruby19 parser with respect to these utf characters. This
snippet is taken (IIRC) from the ruby-json parser.

Regards,
Thomas.

#!/usr/bin/env ruby19
# Author:: Thomas Link (micathom AT gmail com)
# Created:: 2008-02-01.

# The string (in JSON format) is tokenized and pre-validated. Minor
# replacements are made in order to transform the JSON into valid
ruby
# input. The transformed string is then evaluated by ruby, which will
# throw an exception on syntactic errors.
#
# PROBLEMS:
# - The "parser" doesn't per se detect something like {"foo": 1,} or
# [1,2,] since this is valid in ruby. I'm not sure about JSON. Anyway,
I
# included another "invalid" clause in order to catch these cases of
# which I'm not sure how they are handled properly. If you want the
# parser to be more permissive, remove the first "invalid" clause.
#
# REFERENCES:
# http://json.org
# http://www.ietf.org/rfc/rfc4627.txt
class JSONParser

RXE = /
\[|\]|
\{|\}|
(?<name_sep>:)|
(?<invalid>,\s*[}\]])|
,|
(?<string>"([^"\\]++|\\(u[0-9a-fA-F]{4}|[bfnrt"\/\\]))*")|
-?(0|[1-9]\d*+)(\.\d++)?([Ee][+-]?\d++)?(?=\D|$)|
true|
false|
(?<null>null)|
[[:space:][:cntrl:]]++|
(?<invalid>.++)
/xmu

def parse(json)
ruby = json.gsub(RXE) do |t|
m = $~
if m['invalid'] then invalid(m['invalid'])
elsif m['null'] then 'nil'
elsif m['name_sep'] then '=>'
elsif m['string'] then m['string'].gsub(/#/, '\\\\#')
else
t
end
end
begin
return eval(ruby)
rescue Exception => e
invalid(json)
end
end

def invalid(string)
raise RuntimeError, 'Invalid JSON: %s' % string
end

end

if __FILE__ == $0
a = ARGV.join
p a
p JSONParser.new.parse(a)
end

Eric Mahurin

unread,

Feb 3, 2008, 11:40:37 AM2/3/08

to

[Note: parts of this message were removed to make it a legal post.]

On Feb 1, 2008 11:30 AM, ara howard <ara.t....@gmail.com> wrote:

>
> On Feb 1, 2008, at 9:09 AM, Eric Mahurin wrote:
>
> > Maybe just have a little ruby
> > script that generates a stream of repeatable random (but valid) JSON.
>
> cfp2:~ > cat a.rb
> require 'rubygems'
> require 'json'
>

The benchmark below has no gem dependencies, generates tests with deep
structures, and has self-checking. It found a bug that the original unit
tested didn't. By default, this benchmark keeps increasing the depth until
the runtime is >= 1s.

Also, I made some wrappers for the json and fjson gems to see how they
perform. Here are the results on my machine:

fjson: 224823 chars/second
json(pure): 220289 chars/second
json(ext): 1522250 chars/second

require 'benchmark'

class RandomJSON
def initialize(value=0)
@number = -1
@string = -1
@boolean = -1
@constant = -1
@value = value-1
end
def number
case (@number=(@number+1)%3)
when 0 : 0
when 1 : 1234
when 2 : 3.75e+1
end
end
def string
case (@string=(@string+1)%3)
when 0 : ""
when 1 : "JSON"
when 2 : "\"\\\/\b\f\r\t"
end
end
def boolean
case (@boolean=(@boolean+1)%3)
when 0 : false
when 1 : true
when 2 : nil
end
end
def constant
case (@constant=(@constant+1)%3)
when 0 : number
when 1 : string
when 2 : boolean
end
end
def array(depth)
a = []
depth.times {
a << value(depth-1)
}
a
end
def object(depth)
o = {}
depth.times {
o[string] = value(depth-1)
}
o
end
def value(depth, v=nil)
case (v||(@value=(@value+1)%3))
when 0 : array(depth)
when 1 : object(depth)
else constant
end
end
end

generator = RandomJSON.new((ARGV[1]||0).to_i)

parser = JSONParser.new
Benchmark.bm { |b|
l = nil; t = nil
13.times { |depth|
tree = generator.value(depth, depth%2)
s = tree.inspect
#puts s
s.gsub!(/=>/, ':')
s.gsub!(/nil/, 'null')
tree2 = nil
#puts s
l = s.length
t = b.report("#{depth} #{l}") { tree2 = parser.parse(s) }
raise if tree2!=tree
break if (t.real>=(ARGV[0]||1).to_f)
}
puts "#{(l/t.real).to_i} chars/second"
}

Eric Mahurin

unread,

Feb 3, 2008, 11:51:44 AM2/3/08

to

[Note: parts of this message were removed to make it a legal post.]

On Feb 1, 2008 7:55 AM, Ruby Quiz <ja...@grayproductions.net> wrote:

> http://json.org/

This doc is missing two things: 1) exactly what is allowed for the top-level
json, and 2) where can whitespace appear. This is more complete:

http://www.ietf.org/rfc/rfc4627.txt

def test_keyword_parsing
> assert_equal(true, @parser.parse("true"))
> assert_equal(false, @parser.parse("false"))
> assert_equal(nil, @parser.parse("null"))
> end
>
> def test_number_parsing
> assert_equal(42, @parser.parse("42"))

..

> end
>
> def test_string_parsing
> assert_equal(String.new, @parser.parse(%Q{""}))

..

> end
>

The above isn't legal JSON. An array or object (hash) should be at the
top-level. Surround these by brackets and it will be legal.

tho_mica_l

unread,

Feb 3, 2008, 12:12:05 PM2/3/08

to

Here is a patch to make my solution check the return value for
conformance with the rfc:

--- quiz155ir0.rb 2008-02-03 18:07:19.186889600 +0100
+++ quiz155i.rb 2008-02-03 18:08:58.008988800 +0100
@@ -48,10 +48,16 @@
end
end
begin
- return eval(ruby)
+ value = eval(ruby)
+ case value
+ when Array
+ return value
+ when Hash
+ return value if value.all? {|k, v| k.kind_of?
(String)}
+ end
rescue Exception => e
- invalid(json)
end
+ invalid(json)
end

def invalid(string)

Paweł Radecki

unread,

Feb 3, 2008, 1:55:08 PM2/3/08

to

And here's my solution. It passes all given tests but still there are
some tricky cases not handled well (see: code comments). I used
regular expressions extensively but it may not be the best idea in
terms of performance.

Thanks for the quiz!!

#!/usr/bin/env ruby

# Solution to Ruby Quiz #155 (see http://www.rubyquiz.com/quiz155.html)
# by Paweł Radecki (pawel.j...@gmail.com).

$KCODE='UTF-8'
require 'jcode'

class JSONParser

def parse(input)
case input
# TODO: in every case we need to check if pattern matches the
input thoroughly and nothing is left;
# ex. "[3, 5] Pablos" still not handled well, it passes through
instead of giving exception

when '' : raise RuntimeError

# TODO: There needs to be some smart way of choosing whether we
found an object or an array;
# now object has priority and it may be found instead of an
array

#object
when /\{(".+"):(.+)\s*(,\s*(".+"):(.+))+\}/ then
h = Hash.new
$&[1...-1].split(/(.*:.*)?\s*,\s*(.*:.*)?/).each do |e|
a = e.split(/:\s*(\{.*\}\s*)?/);
h[parse(a.first)] = parse(a.last) unless (a.first.nil? &&
a.last.nil?)
end
h
when /\{\s*(".+")\s*:\s*(.+)\s*\}/ then { parse($1) => parse($2) }
when /\{\s*\}/ : Hash.new

#array
when /\[.+\]/ then $&[1...-1].split(/(\[.*\])?\s*,\s*(\[.*
\])?/).collect{|e| parse(e)}
when /\[\s*\]/ then []

#constants
when /true/ then
if ($`.strip.empty? && $'.strip.empty?) then true else raise
RuntimeError end
when /false/ then
if ($`.strip.empty? && $'.strip.empty?) then false else raise
RuntimeError end
when /null/ then nil
if ($`.strip.empty? && $'.strip.empty?) then nil else raise
RuntimeError end

#string
when /"([A-Za-z]|(\s)|(\\")|(\\\\)|(\\\/)|(\\b)|(\\f)|(\\n)|(\\r)|
(\\t)|(\\u[0-9a-fA-F]{4,4}))+"/ : $&[1...-1].gsub(/\\"/, '"').gsub(/\
\n/, "\n").gsub(/\\u([0-9a-fA-F]{4,4})/u){["#$1".hex ].pack('U*')}
when /""/ then ""

#number
when /-?(0|([1-9][0-9]*))(\.[0-9]+)?([e|E][+|-]?[0-9]+)?/ then
if ($`.strip.empty? && $'.strip.empty?) then eval($&) else raise
RuntimeError end
else
raise RuntimeError
end
end
end

#puts JSONParser.new.parse(ARGV.first)

--
Paweł Radecki
e: pawel.j...@gmail.com
w: http://radeckimarch.blogspot.com/

Eric Mahurin

unread,

Feb 3, 2008, 7:02:22 PM2/3/08

to

[Note: parts of this message were removed to make it a legal post.]

On Feb 3, 2008 8:14 AM, steve <oks...@yahoo.com> wrote:

> Hey guys
>
> This is my first parser. I used Nathan Sobo's Treetop parsing library (
> http://treetop.rubyforge.org/, gem install treetop):
>
> http://pastie.caboo.se/146906
>
> require 'treetop'
>
>
> File.open("json.treetop", "w") {|f| f.write GRAMMAR }
>
>
> Treetop.load "json"
>
> parser = JsonParser.new
>
>
> pp parser.parse(STDIN.read).value if $0 == __FILE__
>
>
>
> BEGIN {
>
> GRAMMAR = %q{
>
> grammar Json
>

..

Steve,

I tried this parser, but couldn't get it to work with the String interface
of the quize testbench. Using an IO/stream is how a parser should work, but
you should also be able to wrap a String in a StringIO. It complained about
StringIO#index not being defined, but IO#index isn't defined either.
Couldn't find anyplace where IO#index was monkey-patched.

Eric

Clifford Heath

unread,

Feb 3, 2008, 7:21:30 PM2/3/08

to

Eric Mahurin wrote:
> I tried this parser, but couldn't get it to work with the String interface
> of the quize testbench. Using an IO/stream is how a parser should work,

Not a packrat parser. If you have memory to memoize each grammar rule
attempted at each byte-offset of the input, you have memory to have all
the input in memory at once, and it's more efficient to do so.

Obviously a simple adapter can fix that, but steve's doesn't, and
neither does mine (emailed to JEG before the 24 hours - I'll post
it soon). The adapter is needed because the parse() method expected
is different from the one that Treetop's parsers provide.

Clifford Heath.

Eric Mahurin

unread,

Feb 3, 2008, 7:32:55 PM2/3/08

to

[Note: parts of this message were removed to make it a legal post.]

I was expecting this to work as a proxy to the Treetop parser:

class JSONParser
def initialize
@parser = JsonParser.new
end
def parse(s)
@parser.parse(StringIO.new(s)).value
end
end

Clifford Heath

unread,

Feb 3, 2008, 7:35:50 PM2/3/08

to

Eric Mahurin wrote:
> I was expecting this to work as a proxy to the Treetop parser:
>
> class JSONParser
> def initialize
> @parser = JsonParser.new
> end
> def parse(s)
> @parser.parse(StringIO.new(s)).value
> end
> end

You need to pass the string directly, not as a StringIO.
Treetop parses a String, not an IO.

Eric Mahurin

unread,

Feb 3, 2008, 7:50:16 PM2/3/08

to

[Note: parts of this message were removed to make it a legal post.]

On Feb 1, 2008 7:55 AM, Ruby Quiz <ja...@grayproductions.net> wrote:

> In honor of that, this week's Ruby Quiz is to write a parser for JSON.

Here is another solution that uses my (somewhat) optimized 'grammar' package
directly. The JSON grammar is similar to before. The next release will use
an API closer to the simple Grammar0 package.

http://pastie.caboo.se/147078

Also, I ported this to my development code (which is not checked in to CVS
yet), to see the performance. I grabbed all of the current submissions
along with the fjson and json gems to see how they compare in terms of
performance. I used the previous benchmark that I posted. It also revealed
bugs in these submissions. Here is the performance I found on my machine
with ruby 1.8.6:

ch/s author/gem
---- ----------
- oksteev (TreeTop, couldn't get it to parse a string)
- Pawel Radecki (RE, mismatch)
- Justin Ethier (RE lexer + parser, 71: Invalid number 0)
4054 Eric Mahurin (Grammar0, no lexer, no parser generation)
54586 Eric Mahurin (Grammar, no lexer, v0.5)
166041 Thomas Link (RE, ruby 1.9 results)
220289 json
223486 Eric Mahurin (Grammar, no lexer, unreleased)
224823 fjson (uses C extensions)
553081 Eric Mahurin (Grammar, no lexer, unreleased, w/ ruby2cext)
1522250 json (w/ C extensions)

Note that the Grammar variants don't have the advantage of using RegExp
where you get some C performance. But, in my dev code, I'm using ruby2cext
to get a little more performance. You could integrate a RegExp lexer with a
Grammar parser, also.

Clifford Heath

unread,

Feb 3, 2008, 8:05:10 PM2/3/08

to

Here's mine, done with Treetop. It also includes a
Readline-based interpretive checker. The generated
parser from Treetop has a slightly different interface,
so I've included JEG's test program with an adapter at
the top.

The nicest thing about using Treetop is how close the
grammar to the JSON spec :-).

I prefer to convert hash keys to symbols, but the test
cases don't allow that so I stripped out my .to_sym's.

Note that the test cases are rather limited in things
like white-space handling (and in fact the JSON spec is
actually incorrect, in that it doesn't define which rules
constitute tokens that may be separated by whitespace!)
Whitespace in Treetop must be handled explicitly, and it's
easy to miss a spot where it should be skipped, so the
tests should cover that.

I welched on full Unicode support as commented in my code,
but there's another test case you should apply, to parse
the string "\\u1234", which should throw an exception.
You'll see that my code is missing that exception, and
will misbehave instead :-).

It wasn't clear from the quiz or the JSON spec whether an
integer is valid JSON. I elected to accept any value, not
just an object or array.

Treetop now uses Polyglot, which loads the generated .rb
file if you've generated it, or the .treetop file if not.

Clifford Heath.

First, the interactive test program:

require 'treetop'
require 'json' # Note that we can require the Treetop file directly.
require 'readline'

parser = JsonParser.new
while line = Readline::readline("? ", [])
begin
tree = parser.parse(line)
if tree
p tree.obj
else
puts parser.failure_reason
end
rescue => e
puts e
p e.backtrace
p tree if tree
end
end
puts

Now, my test adapter:

class JSONParser
def parse(text)
parser = JsonParser.new
p = parser.parse(text)
raise parser.failure_reason unless p
p.obj
end
end

Finally, the grammar itself:

# Treetop grammar for JSON for Ruby Quiz #155 by Clifford Heath.
grammar Json
rule json
value
end

rule object
'{' s pairs:pairs? s '}' s
{ def obj
pairs.empty? ? {} : pairs.obj
end
}
end

rule pairs
member rest:(s ',' s member)*
{ def obj
rest.elements.inject({eval(member.k.text_value) => member.value.obj}) { |h, e|
h[eval(e.member.k.text_value)] = e.member.value.obj
h
}
end
}
end

rule member # key/value pair of an object
k:string s ':' s value
end

rule array
'[' s e:elements? s ']'
{ def obj
e.empty? ? [] : e.obj
end
}
end

rule elements # elements of an array
value rest:(s ',' s value)*
{ def obj
rest.elements.inject([value.obj]) { |a, e|
a << e.value.obj
}
end
}
end

rule value
s alt:(string / number / object / array
/ 'true' { def obj; true; end }
/ 'false' { def obj; false; end }
/ 'null' { def obj; nil; end }
)
{ def obj; alt.obj; end }
end

rule string
'"' char* '"' { def obj
eval(
# Strip Unicode characters down to the chr equivalent.
# Note that I'm cheating here: '"\\u4321"' should assert,
# and there are cases that will succeed but corrupt the data.
# This should be handled in the "char" rule.
text_value.gsub(/\\u..../) { |unicode|
eval("0x"+unicode[2..-1]).chr
}
)
end
}
end

rule char
'\\' [\"\\\/bfnrt]

/ '\\u' hex hex hex hex

/ (![\\"] .)
end

rule hex
[0-9A-Fa-f]
end

rule number
int frac? exp? { def obj; eval(text_value); end }
end

rule int # Any integer
'-'? ([1-9] [0-9]* / '0')
{ def obj; eval(text_value); end }
end

rule frac # The fractional part of a floating-point number
'.' [0-9]+
end

rule exp # An exponent
[eE] [-+]? [0-9]+
end

rule s # Any amount of whtespace
[ \t\n\t]*
end

end

Eric Mahurin

unread,

Feb 3, 2008, 8:05:21 PM2/3/08

to

[Note: parts of this message were removed to make it a legal post.]

I guess my eyes missed the ".read" after the STDIN . The "STDIN" jumped out
and I thought it took an IO-like object. I was able to just remove a couple
lines of Steve's code and rename the class to get it to work with the tests
(unit and random performance benchmark). It mismatches in both the unit
tests and the performance benchmark, but I if I remove the self-checking I
was able to run the performance benchmark. It was only about 6800
chars/second which is near the bottom of the heap. Not sure how valid that
is since the parser has problems though.

Eric

Eric I.

unread,

Feb 3, 2008, 8:12:30 PM2/3/08

to

Here is my solution, which, like steve's, uses Treetop. This is my
first attempt at using Treetop, so I'm not certain how good the
grammar file that I created is. steve's grammar file is smaller than
mine is.

When I benchmarked my code, I'm getting a parsing performance of
around 10,000 chars/second, much lower than many of the other
solutions.

The two components of the solution are below -- a small Ruby class
(that acts as an interface b/w James' unit testing and what Treetop
produces) and a Treetop grammar file.

Eric

====

Here's the small Ruby class:

# A solution to RubyQuiz #155.
#
# Takes a JSON string and parses it into an equivalent Ruby value.
#
# See http://www.rubyquiz.com/quiz155.html for details.
#
# The latest version of this solution can also be found at
# http://learnruby.com/examples/ruby-quiz-155.shtml .

require 'rubygems'
require 'treetop'

Treetop.load 'json'

class JSONParser
def initialize
@parser = JSONHelperParser.new
end

def parse(input)
result = @parser.parse(input)
raise "could not parse" if result.nil?
result.resolve
end
end

====

And here's the Treetop grammar:

# Treetop grammar for JSON. Note: it doesn't handle Unicode very
well.

grammar JSONHelper
rule value
spaces v:(simple_literal / string / number / array / object)
spaces {
def resolve
v.resolve
end
}
end

rule spaces
[ \t\n\r]*
end

# SIMPLE LITERALS

rule simple_literal

("true" / "false" / "null") {

def resolve
case text_value
when "true" : true
when "false" : false
when "null" : nil
end
end
}
end

# NUMBERS

rule number
integer fractional:fractional? exponent:exponent? {
def resolve
if fractional.text_value.empty? && exponent.text_value.empty?
integer.text_value.to_i
else
text_value.to_f
end
end
}
end

rule integer
"-"? [1-9] digits
/
# single digit
"-"? [0-9]
end

rule fractional
"." digits
end

rule exponent
[eE] [-+]? digits
end

rule digits
[0-9]*
end

# STRINGS

rule string
"\"\"" {
def resolve
""
end
}
/
"\"" characters:character* "\"" {
def resolve
characters.elements.map { |c| c.resolve }.join
end
}
end

rule character
# regular characters
(!"\"" !"\\" .)+ {
def resolve
text_value
end
}
/
# escaped: \\, \", and \/
"\\" char:("\"" / "\\" / "/") {
def resolve
char.text_value
end
}
/
# escaped: \b, \f, \n, \r, and \t
"\\" char:[bfnrt] {
def resolve
case char.text_value
when 'b' : "\b"
when 'f' : "\f"
when 'n' : "\n"
when 'r' : "\r"
when 't' : "\t"
end
end
}
/
# for Unicode that overlay ASCII values, we just use the ASCII
value
"\\u00" digits:(hex_digit hex_digit) {
def resolve
str = " "
str[0] = digits.text_value.hex
str
end
}
/
# for all other Unicode values use the null character
"\\u" digits1:(hex_digit hex_digit) digits2:(hex_digit hex_digit)
{
def resolve
str = " "
str[0] = digits1.textvalue.hex
str[1] = digits2.textvalue.hex
str
end
}
end

rule hex_digit
[0-9a-fA-F]
end

# ARRAYS

rule array
"[" spaces "]" {
def resolve
Array.new
end
}
/
"[" spaces value_list spaces "]" {
def resolve
value_list.resolve
end
}
end

rule value_list
value !(spaces ",") {
def resolve
[ value.resolve ]
end
}
/
value spaces "," spaces value_list {
def resolve
value_list.resolve.unshift(value.resolve)
end
}
end

# OBJECTS

rule object
"{" spaces "}" {
def resolve
Hash.new
end
}
/
"{" spaces pair_list spaces "}" {
def resolve
pair_list.resolve
end
}
end

rule pair_list
pair !(spaces ",") {
def resolve
{ pair.resolve[0] => pair.resolve[1] }
end
}
/
pair spaces "," spaces pair_list {
def resolve
hash = pair_list.resolve
hash[pair.resolve[0]] = pair.resolve[1]
hash
end
}
end

rule pair
string spaces ":" spaces value {
def resolve
[ string.resolve, value.resolve ]
end
}
end
end

Clifford Heath

unread,

Feb 3, 2008, 8:42:16 PM2/3/08

to

Eric Mahurin wrote:
> It was only about 6800
> chars/second which is near the bottom of the heap. Not sure how valid that
> is since the parser has problems though.

I'm not surprised. Treetop has been designed to be super-clean and pure,
not fast. It'll get faster. Myself, I think it should have real Regex's at
the leaves, and that'd make a *heap* of difference.

Also, MRI is glacially slow at creating objects, which Treetop does a lot
of (in this case, for every char in a string for example). If it used Structs,
or if it used Rubinius, things would be a lot quicker.

One other addition we're considering is to add skip rules, which don't build
nodes, for things like white-space and punctuation, which would basically
double the speed.

Consider also that the backtracking allows you to handle grammars that are
simply impossible with traditional techniques... like I'm doing with CQL
in ActiveFacts.

Clifford Heath.

steve

unread,

Feb 3, 2008, 10:45:59 PM2/3/08

to

[Note: parts of this message were removed to make it a legal post.]

> I guess my eyes missed the ".read" after the STDIN . The "STDIN" jumped

> out
> and I thought it took an IO-like object. I was able to just remove a
> couple
> lines of Steve's code and rename the class to get it to work with the
> tests
> (unit and random performance benchmark). It mismatches in both the unit
> tests and the performance benchmark, but I if I remove the self-checking I
> was able to run the performance benchmark. It was only about 6800
> chars/second which is near the bottom of the heap. Not sure how valid
> that
> is since the parser has problems though.
>

Hey Eric,

What do you mean it mismatches? Also what do you mean by self-checking?
It passes all the unit tests for me.

I just found out why you probably had problems with my solution on the
benchmark. You need to call .value on the object returned by the parser to
get the actual parsed json value.

t = b.report("#{depth} #{l}") { tree2 = parser.parse(s) }

-->
t = b.report("#{depth} #{l}") { tree2 = parser.parse(s).value }

- steve

steve

unread,

Feb 3, 2008, 11:14:45 PM2/3/08

to

[Note: parts of this message were removed to make it a legal post.]

Here are some extra unit tests i wrote. Let me know if you think any of
them are incorrect (per the spec):

def test_more_numbers
assert_equal(5, @parser.parse("5"))
assert_equal(-5, @parser.parse("-5"))
assert_equal 45.33, @parser.parse("45.33")
assert_equal 0.33, @parser.parse("0.33")
assert_equal 0.0, @parser.parse("0.0")
assert_equal 0, @parser.parse("0")
assert_raises(RuntimeError) { @parser.parse("-5.-4") }
assert_raises(RuntimeError) { @parser.parse("01234") }
assert_equal(0.2e1, @parser.parse("0.2E1"))
assert_equal(42e10, @parser.parse("42E10"))
end

def test_more_string
assert_equal("abc\befg", @parser.parse(%Q{"abc\\befg"}))
assert_equal("abc\nefg", @parser.parse(%Q{"abc\\nefg"}))
assert_equal("abc\refg", @parser.parse(%Q{"abc\\refg"}))
assert_equal("abc\fefg", @parser.parse(%Q{"abc\\fefg"}))
assert_equal("abc\tefg", @parser.parse(%Q{"abc\\tefg"}))
assert_equal("abc\\efg", @parser.parse(%Q{"abc\\\\efg"}))
assert_equal("abc/efg", @parser.parse(%Q{"abc\\/efg"}))
end

def test_more_object_parsing
assert_equal({'a'=>2,'b'=>4}, @parser.parse(%Q{{ "a" : 2 , "b":4 }}))
assert_raises(RuntimeError) { @parser.parse(%Q{{ "a" : 2, }}) }
assert_raises(RuntimeError) { @parser.parse(%Q{[ "a" , 2, ]}) }
end

On Feb 1, 2008 8:55 PM, Ruby Quiz <ja...@grayproductions.net> wrote:

> The three rules of Ruby Quiz:
>
> 1. Please do not post any solutions or spoiler discussion for this quiz
> until
> 48 hours have passed from the time on this message.
>
> 2. Support Ruby Quiz by submitting ideas as often as you can:
>
> http://www.rubyquiz.com/
>
> 3. Enjoy!
>
> Suggestion: A [QUIZ] in the subject of emails about the problem helps
> everyone
> on Ruby Talk follow the discussion. Please reply to the original quiz
> message,
> if you can.
>
>
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
>
> There has been a lot of talk recently about parsing with Ruby. We're
> seeing
> some parser generator libraries pop up that make the task that much easier
> and
> they've been stirring up interest.
>

> In honor of that, this week's Ruby Quiz is to write a parser for JSON.
>

> JSON turns out to turns out to be a great little example for writing
> parsers for
> two reasons. First, it's pretty easy stuff. You can hand-roll a JSON
> parser in
> under 100 lines of Ruby. The second advantage is that the data format is
> wonderfully documented:
>
> http://json.org/
>
> Since JSON is just a data format and Ruby supports all of the data types,
> I vote
> we just use Ruby itself as the abstract syntax tree produced by the parse.
>
> Feel free to show off your favorite parser generator, if you don't want to
> roll
> your own. Anything goes.
>
> Here are a few tests to get you started:
>
> require "test/unit"
>
> class TestJSONParser < Test::Unit::TestCase
> def setup
> @parser = JSONParser.new
> end
>

> def test_keyword_parsing
> assert_equal(true, @parser.parse("true"))
> assert_equal(false, @parser.parse("false"))
> assert_equal(nil, @parser.parse("null"))
> end
>
> def test_number_parsing
> assert_equal(42, @parser.parse("42"))

> assert_equal(-13, @parser.parse("-13"))
> assert_equal(3.1415, @parser.parse("3.1415"))
> assert_equal(-0.01, @parser.parse("-0.01"))
>
> assert_equal(0.2e1, @parser.parse("0.2e1"))
> assert_equal(0.2e+1, @parser.parse("0.2e+1"))
> assert_equal(0.2e-1, @parser.parse("0.2e-1"))
> assert_equal(0.2E1, @parser.parse("0.2e1"))

> end
>
> def test_string_parsing
> assert_equal(String.new, @parser.parse(%Q{""}))

Eric Mahurin

unread,

Feb 3, 2008, 11:26:07 PM2/3/08

to

[Note: parts of this message were removed to make it a legal post.]

Thanks Steve,

That was it. I didn't make a wrapper class like I did for the other Treetop
parsers. Your parser works without problems now.

Also, I grabbed Eric I's and Clifford Heath's Treetop parsers and
benchmarked them. Here's what I found:

ch/s author/gem
---- ----------
- Pawel Radecki (RE, mismatch)
3214 Justin Ethier (RE lexer + parser, fixed number parsing)

4054 Eric Mahurin (Grammar0, no lexer, no parser generation)

4078 Eric I (Treetop, unicode broken)
6534 oksteev (Treetop, mismatches in benchmark)
8313 Clifford Heath (Treetop, had to remove handling of "\/")

Eric Mahurin

unread,

Feb 3, 2008, 11:53:55 PM2/3/08

to

[Note: parts of this message were removed to make it a legal post.]

On Feb 3, 2008 7:44 PM, Clifford Heath <n...@spam.please.net> wrote:

> Eric Mahurin wrote:
> > It was only about 6800
> > chars/second which is near the bottom of the heap. Not sure how valid
> that
> > is since the parser has problems though.
>
> I'm not surprised. Treetop has been designed to be super-clean and pure,
> not fast. It'll get faster. Myself, I think it should have real Regex's at
> the leaves, and that'd make a *heap* of difference.

I'm biased, but I don't think Treetop is super-clean and pure. Take a look
at my rubyforge grammar package if you want something really clean and pure:

* specify language grammars in a BNF-like ruby DSL (no new grammar language
to deal with!)
- integration of lexer/parser with other code is seamless since everything
is ruby
- (sub)grammars are objects and you combine them using operators and
methods
* complete API can be described in only a little more than a hundred lines
of code (but the implementation is a lot more since it has lots of
optimizations)
* heavily duck-typed such that a lexer or parser (or preprocessor, or parser
w/o lexer) can specified by the same DSL
* infinite backtracking (not by default because of performance and
functionality, you specify what needs it)
* can make any kind of pipeline of lexers/parsers and multi-thread them.
* even with its pure ruby/no-regexp lexers/parsers, it can go head-to-head
against Regexp-based stuff. The flexibility of Regexp is limited, so I
don't see the point since 'grammar' gets enough performance.
* don't need to read the file into a string to parse it
* and more

Also, MRI is glacially slow at creating objects, which Treetop does a lot
> of (in this case, for every char in a string for example). If it used
> Structs,
> or if it used Rubinius, things would be a lot quicker.
>
> One other addition we're considering is to add skip rules, which don't
> build
> nodes, for things like white-space and punctuation, which would basically
> double the speed.

With 'grammar' you have complete control of the parsing result. Without
actions, the output stream is a copy of the input stream. You can group
things and discard things efficiently. One extreme would be to interpret
the input on the fly and not build anything. I have an example tcl
interpreter done in a couple hundred lines of code using 'grammar', that
produces no AST - it just interprets while parsing.

Clifford Heath

unread,

Feb 4, 2008, 12:40:31 AM2/4/08

to

Eric Mahurin wrote:
> I'm biased,

First up, it sounds like your package is a significant achievement,
and I don't want anything here to imply I'm not commending you on
it. But different approaches suit different goals...

> but I don't think Treetop is super-clean and pure. Take a look
> at my rubyforge grammar package if you want something really clean and pure:

:-) To me, to use a separate grammar is *more pure* than to use
a Ruby DSL. I can see how you would argue the opposite however...
Your example parsers are much harder to read than Treetop however,
compare:
space = Grammar::Element[Set[?\ ,?\t].duck!(:==,:include?)].discard
with
skip space
[ \t]
end

(ok, so TT doesn't have a skip keyword yet, soon...)
That's what I mean by clean and pure anyhow. Not pure Ruby,
but pure grammar.

Nathan's concept is that "grammar" should become a Ruby keyword,
and should introduce the grammar meta-grammar, rather than using
existing Ruby syntax. I think that polyglot approach is much better
than the DSL approach, since the Ruby grammar isn't well-suited to
many important languages.

The point is, if/when Ruby uses a parser whose grammar is composable
with a DSL grammar you define, the result is truly seamless and
the syntax is unrestrained. Existing DSL syntax imposes too many
restraints. A language with a composable grammar would be truly
excellent!

> * specify language grammars in a BNF-like ruby DSL (no new grammar language
> to deal with!)

I think I've argued why that is a bad thing :-), though it definitely
has good things about it also - like your filters etc...

Ruby DSL is far too awkward to express CQL - which is my primary
project at present.

> - integration of lexer/parser with other code is seamless since everything
> is ruby

Nice, but it won't help me when I need to target C# or Java.
Most important grammars need to be compilable to more than
one target language. Treetop's not there yet, but it's easy
to see how to do it, whereas it won't ever happen with yours
AFAICS.

> - (sub)grammars are objects and you combine them using operators and
> methods

Can't get much easier than just "include otherGrammar", which
works in Treetop (mix the generated modules together).

> * complete API can be described in only a little more than a hundred lines

Similarly to Treetop, AFAICS.

> * infinite backtracking (not by default because of performance and
> functionality, you specify what needs it)

Does this memoize, or give exponential performance? Or is it optional?

> * can make any kind of pipeline of lexers/parsers and multi-thread them.
> * even with its pure ruby/no-regexp lexers/parsers, it can go head-to-head
> against Regexp-based stuff. The flexibility of Regexp is limited, so I
> don't see the point since 'grammar' gets enough performance.
> * don't need to read the file into a string to parse it
> * and more

Very cool stuff.

> With 'grammar' you have complete control of the parsing result. Without
> actions, the output stream is a copy of the input stream. You can group
> things and discard things efficiently. One extreme would be to interpret
> the input on the fly and not build anything.

Yes, this is excellent, and something I've always wanted from a parser
generator. You should add your json parsers to the examples in the gem!

Clifford Heath.

Joel VanderWerf

unread,

Feb 4, 2008, 12:52:06 AM2/4/08

to

Clifford Heath wrote:
> Your example parsers are much harder to read than Treetop however,
> compare:
> space = Grammar::Element[Set[?\ ,?\t].duck!(:==,:include?)].discard

Surely you can sugar that up a bit, something like

space = Grammar.charset(" \t").discard

> with
> skip space
> [ \t]
> end

--
vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407

Clifford Heath

unread,

Feb 4, 2008, 1:16:24 AM2/4/08

to

Joel VanderWerf wrote:
> Clifford Heath wrote:
>> Your example parsers are much harder to read than Treetop however,

>> space = Grammar::Element[Set[?\ ,?\t].duck!(:==,:include?)].discard
> Surely you can sugar that up a bit, something like
> space = Grammar.charset(" \t").discard

Well, maybe. I just grabbed the first incomprehensible line from
Eric's Tcl example grammar, in order to point out a difference in
our meanings for "clean and pure" :-).

Clifford Heath.

tho_mica_l

unread,

Feb 4, 2008, 2:47:37 AM2/4/08

to

> What do you mean it mismatches? Also what do you mean by self-checking?
> It passes all the unit tests for me.

I think he meant that it doesn't throw exceptions on error. This is
the
wrapper I used to make it conform with the provided tests:

class JSONParser
def initialize
@parser = JsonParser.new
end

def parse(text)
rv = @parser.parse(text)
if rv
return rv.value

else
raise RuntimeError
end
end
end

Regards,
Thomas.

tho_mica_l

unread,

Feb 4, 2008, 3:04:11 AM2/4/08

to

> rule string
> '"' char* '"' { def obj
> eval(

> # This should be handled in the "char" rule.
> text_value.gsub(/\\u..../) { |unicode|
> eval("0x"+unicode[2..-1]).chr
> }
> )
> end
> }
> end

I was slightly surprised to find this in a full-blown parser. I think
this shortcut is a really bad choice here since it makes the parse
execute code on input like ["#{`ls -r /`}"]. This should definitely be
handled in the char rule. Justin Ethier's solution suffers from the
same problem but he included a comment on this problem in his code.

Here are some random test cases to check this:

assert_raise(RuntimeError) { @parser.parse(%{p "Busted"}) }
assert_raise(RuntimeError) { @parser.parse(%{[], p "Busted"}) }
assert_raise(RuntimeError) { @parser.parse(%{[p "Busted"]}) }
assert_raise(RuntimeError) { @parser.parse(%{{1 =>
STDOUT.puts("Busted")}}) }
assert_raise(RuntimeError) { @parser.parse(%{"\u0022; p 123;
\u0022Busted"}) }
assert_raise(RuntimeError) { @parser.parse(%{"" p 123; ""}) }
assert_equal("\u0022; p 123; \u0022Busted",
@parser.parse(%{"\\u0022; p 123; \\u0022Busted"}))
assert_equal('#{p 123}', @parser.parse(%q{"#{p 123}"}))
assert_equal('["#{`ls -r`}"]', @parser.parse(%q{["#{`ls -r`}"]}))
assert_equal('#{p 123}', @parser.parse(%q{"\\u0023{p 123}"}))
assert_equal('#{p 123}', @parser.parse(%q{"\u0023{p 123}"}))

Regards,
Thomas.

tho_mica_l

unread,

Feb 4, 2008, 3:10:57 AM2/4/08

to

> assert_equal('["#{`ls -r`}"]', @parser.parse(%q{["#{`ls -r`}"]}))

People should really test their test case. At least I should test mine
before sending them to the list. This line should be:

Paolo Bonzini

unread,

Feb 4, 2008, 5:01:46 AM2/4/08

to

Here is my try using regexes. I use the "copy-on-write trick" from
the suffix tree quiz: the regex is always anchored to the beginning of
the string using \A, and the matched text is discarded using
post_match. In some places where I don't want to discard I use (?
=...).

Using Eric's benchmark I get 36kb/sec, but I haven't benchmarked any
other solution.

http://pastie.caboo.se/147201

Paolo

Clifford Heath

unread,

Feb 4, 2008, 5:07:14 AM2/4/08

to

If you check my version, you'll see that you can raise a meaningful
RuntimeError by raising @parser.failure_reason - this produces an
error message that says how far the parse progressed, and which tokens
might have allowed it to progress further.

Clifford Heath.

Clifford Heath

unread,

Feb 4, 2008, 5:11:01 AM2/4/08

to

tho_mica_l wrote:
>> rule string
>> '"' char* '"' { def obj
>> eval(
>> # This should be handled in the "char" rule.
>> text_value.gsub(/\\u..../) { |unicode|
>> eval("0x"+unicode[2..-1]).chr
>> }
>> )
>> end
>> }
>> end
>
> I was slightly surprised to find this in a full-blown parser. I think
> this shortcut is a really bad choice here since it makes the parse
> execute code on input like ["#{`ls -r /`}"].

At least I *admitted* to cheating :-).

Paolo Bonzini

unread,

Feb 4, 2008, 5:26:16 AM2/4/08

to

For what it's worth, I get 144kb/sec from tho_mica_l's solution, after
converting it to Ruby 1.8 like this:

class JSONParser

RXE = /
\[|\]|
\{|\}|
(:)|
(,\s*[}\]])|
,|
("(?>[^"\\]+|\\(?:u[0-9a-fA-F]{4}|[bfnrt"\/\\]))*")|
-?(?=\d)(?>0|[1-9]\d*)(?>\.\d+)?(?>[Ee][+-]?\d+)?(?=\D|$)|
true|
false|
(null)|
(?>[[:space:][:cntrl:]]+)|
((?>.+))
/xmu

def parse(json)
ruby = json.gsub(RXE) do |t|
if !$5.nil?||!$2.nil? then invalid($5.nil? ? $2 :
$5)
elsif !$4.nil? then 'nil'
elsif !$1.nil? then '=>'
elsif !$3.nil? then $3.gsub(/#/, '\\\\#')
else
t
end
end
begin
return eval(ruby)
rescue Exception => e
invalid(json)
end
end

def invalid(string)
raise RuntimeError, 'Invalid JSON: %s' % string
end

end

Trans

unread,

Feb 4, 2008, 8:51:06 AM2/4/08

to

Just an FYI solution, JSON is a subset of YAML. So

data = YAML.load(json)

T.

tho_mica_l

unread,

Feb 4, 2008, 9:09:01 AM2/4/08

to

> Just an FYI solution, JSON is a subset of YAML. So
>
> data = YAML.load(json)

Interesting suggestion, but ruby seems to disagree:

irb(main):006:0> YAML.load_file 'test.json'
ArgumentError: syntax error on line 0, col 99: `{"a":2,"b":
3.141,"TIME":"2007-03-14T11:52:40","c":"c","d":[1,"b",3.14],"COUNT":
666,"e":{"foo":"bar"},"foo":"B\u00e4r","g":"\u677e\u672c\u884c
\u5f18","h":1000.0,"bar":"\u00a9 \u2260 \u20ac!","i":
0.001,"j":"\ud840\udc01"}'

Eric Mahurin

unread,

Feb 4, 2008, 10:57:09 AM2/4/08

to

[Note: parts of this message were removed to make it a legal post.]

Thank you. Now we can compare apples to apples with this on 1.8.6. I are
the results with these two new benchmarks on my machine:

ch/s author/gem
---- ----------
- Pawel Radecki (RE, mismatch)

3214 Justin Ethier (RE lexer + ruby eval, fixed number parsing)

4054 Eric Mahurin (Grammar0, no lexer, no parser generation)
4078 Eric I (Treetop, unicode broken)
6534 oksteev (Treetop, mismatches in benchmark)
8313 Clifford Heath (Treetop, had to remove handling of "\/")
54586 Eric Mahurin (Grammar, no lexer, v0.5)

137989 Paolo Bonzini (RE)
166041 Thomas Link (RE lexer + ruby eval, ruby 1.9 results)

220289 json
223486 Eric Mahurin (Grammar, no lexer, unreleased)
224823 fjson (uses C extensions)

333368 Thomas Link & Paolo Bonzini (RE + eval, unicode broken)

553081 Eric Mahurin (Grammar, no lexer, unreleased, w/ ruby2cext)
1522250 json (w/ C extensions)

Looks like the above solution is fastest pure-ruby solution out there. It's
interesting that the performance is better in 1.8.6 than 1.9. The "eval"
solution is a nice trick as it uses ruby's C parser to act as the parser.
But, this isn't viable for most other languages you might parse. Maybe fine
for this JSON case though. The "eval" does seem a bit dangerous though.

Eric Mahurin

unread,

Feb 4, 2008, 11:46:33 AM2/4/08

to

[Note: parts of this message were removed to make it a legal post.]

On Feb 3, 2008 11:45 PM, Clifford Heath <n...@spam.please.net> wrote:

> Eric Mahurin wrote:
> > I'm biased,
>
> First up, it sounds like your package is a significant achievement,
> and I don't want anything here to imply I'm not commending you on
> it. But different approaches suit different goals...

Thanks!

> but I don't think Treetop is super-clean and pure. Take a look
> > at my rubyforge grammar package if you want something really clean and
> pure:
>
> :-) To me, to use a separate grammar is *more pure* than to use
> a Ruby DSL. I can see how you would argue the opposite however...
> Your example parsers are much harder to read than Treetop however,
> compare:
> space = Grammar::Element[Set[?\ ,?\t].duck!(:==,:include?)].discard
> with
> skip space
> [ \t]
> end
>

I agree. It is ugly. The ugliness can be easily fixed though:
1. when making a parser, derive from Grammar so that namespace is available
2. E = Element, so that you can have a shorthand for the most common atomic
Grammar
3. Performance-wise, I found that using a Set for matching was slower than
just having a few (ranged) alternatives.
4. Grammar::Element.[] is simply an alias for Grammar::Element.new. My plan
is that in the next release, I'll make all the classes "callable" by
defining methods of their name that call new. This is better when .new can
take a block (like Recurse).
5. I made a bad choice of #== for the matching method for the pattern
argument. #=== would be better and is what I'm using in my dev code (and
the Grammar0 I posted). This would allow it to work out of the box with
Ranges too. So, you could do this: alpha = E(?a..?z) | E(?A..?Z)

So, doing 1..3, in the released grammar v0.5, you'd have this instead:

space = (E[?\s] | E[?\t]).discard

(ok, so TT doesn't have a skip keyword yet, soon...)
> That's what I mean by clean and pure anyhow. Not pure Ruby,
> but pure grammar.
>
> Nathan's concept is that "grammar" should become a Ruby keyword,
> and should introduce the grammar meta-grammar, rather than using
> existing Ruby syntax. I think that polyglot approach is much better
> than the DSL approach, since the Ruby grammar isn't well-suited to
> many important languages.

I don't think ruby needs any new keywords. It already has more than it
needs in my opinion. There is enough power with just classes, methods,
blocks, and operator overloading.

The point is, if/when Ruby uses a parser whose grammar is composable
> with a DSL grammar you define, the result is truly seamless and
> the syntax is unrestrained. Existing DSL syntax imposes too many
> restraints. A language with a composable grammar would be truly
> excellent!
>
> > * specify language grammars in a BNF-like ruby DSL (no new grammar
> language
> > to deal with!)
>
> I think I've argued why that is a bad thing :-), though it definitely
> has good things about it also - like your filters etc...
>
> Ruby DSL is far too awkward to express CQL - which is my primary
> project at present.

Don't know anything about CQL to answer. I would like to make what I have
general enough.

> - integration of lexer/parser with other code is seamless since
> everything
> > is ruby
>
> Nice, but it won't help me when I need to target C# or Java.
> Most important grammars need to be compilable to more than
> one target language. Treetop's not there yet, but it's easy
> to see how to do it, whereas it won't ever happen with yours
> AFAICS.

It definitely could happen with mine. I do ruby code generation now, but
could have another Grammar-like class (or another mechanism) to generate
code for another language. I would have to add a new method for dumping out
the generated code (right now it is only eval'ed immediately). In my dev
code, I'm also using ruby2cext to compile it on the fly. I'll try to have a
general language infrastructure in the next release.

> - (sub)grammars are objects and you combine them using operators and
> > methods
>
> Can't get much easier than just "include otherGrammar", which
> works in Treetop (mix the generated modules together).
>
> > * complete API can be described in only a little more than a hundred
> lines
>
> Similarly to Treetop, AFAICS.
>
> > * infinite backtracking (not by default because of performance and
> > functionality, you specify what needs it)
>
> Does this memoize, or give exponential performance? Or is it optional?

In general, the parsers it generates are LL(1). It makes decisions one
character/token at a time. But, if you use the #lookahead method on any
grammar, it will give a new grammar that handles the case when the grammar
fails in the middle (backtracks to the position where the grammar started).
For most languages I've seen this is rarely needed, but it is in your back
pocket to be used when needed. I haven't done anything to try to optimize
it as it isn't needed that often.

> * can make any kind of pipeline of lexers/parsers and multi-thread them.
> > * even with its pure ruby/no-regexp lexers/parsers, it can go
> head-to-head
> > against Regexp-based stuff. The flexibility of Regexp is limited, so I
> > don't see the point since 'grammar' gets enough performance.
> > * don't need to read the file into a string to parse it
> > * and more
>
> Very cool stuff.
>
> > With 'grammar' you have complete control of the parsing result.
> Without
> > actions, the output stream is a copy of the input stream. You can
> group
> > things and discard things efficiently. One extreme would be to
> interpret
> > the input on the fly and not build anything.
>
> Yes, this is excellent, and something I've always wanted from a parser
> generator. You should add your json parsers to the examples in the gem!

Yes, next release.

Paolo Bonzini

unread,

Feb 4, 2008, 11:56:12 AM2/4/08

to

> The "eval" does seem a bit dangerous though.

Especially if you take it a bit to the edge like this (plug into the
other solution):

RXE = /
[\[\]\{\}[:space:][:cntrl:]truefals]+|
(:)|
(?:,(?>\s*)(?![}\]]))|

("(?>[^"\\]+|\\(?:u[0-9a-fA-F]{4}|[bfnrt"\/\\]))*")|
-?(?=\d)(?>0|[1-9]\d*)(?>\.\d+)?(?>[Ee][+-]?\d+)?(?=\D|$)|

(null)|
(.+)
/xmu

def parse(json)
ruby = json.gsub(RXE) do |t|

if !$4.nil? then invalid($4)
elsif !$3.nil? then 'nil'
elsif !$1.nil? then '=>'
elsif !$2.nil? then $2.gsub(/#/, '\\\\#')

else
t
end
end
begin
return eval(ruby)
rescue Exception => e
invalid(json)
end
end

This works because the "null" token does not *start* with any letter
in "true" or "false". "fnull" would be happily converted to "fnil",
but eval catches that luckily.

As ugly as it is, it cranks in 200kb/sec on my machine, which should
be more than 400kb/sec on yours. 25-30% of C speed is quite
impressive (I had to say this since I was bitching about Ruby's speed
last week...).

Paolo

James Gray

unread,

Feb 4, 2008, 12:07:22 PM2/4/08

to

On Feb 3, 2008, at 10:51 AM, Eric Mahurin wrote:

> On Feb 1, 2008 7:55 AM, Ruby Quiz <ja...@grayproductions.net> wrote:
>

>> http://json.org/
>
>
> This doc is missing two things: 1) exactly what is allowed for the
> top-level json,

I guess it does hint at this when it says:

"JSON is built on two structures:"

and goes on to describe object and array. I just didn't interpret it
well.

Thanks for clarifying.

> and 2) where can whitespace appear.

Good point.

> This is more complete:
>
> http://www.ietf.org/rfc/rfc4627.txt

Good resource. Thanks for making me aware of it.

James Edward Gray II

tho_mica_l

unread,

Feb 4, 2008, 12:32:24 PM2/4/08

to

> This works because the "null" token does not *start* with any letter
> in "true" or "false". "fnull" would be happily converted to "fnil",
> but eval catches that luckily.

I'm not sure I understand your argument. f will always be considered
illegal. It will never reach the eval(). Only text matching a clause
not
defined illegal in the rx get fed to eval(). You could also wrap the
atoms between \b (eg \bnull\b) but it seemed unnecessary. Show me an
example for malicious code that passes through.

tho_mica_l

unread,

Feb 4, 2008, 12:47:04 PM2/4/08

to

> > This works because the "null" token does not *start* with any letter
> > in "true" or "false". "fnull" would be happily converted to "fnil",
> > but eval catches that luckily.

Well, I missed the changes in the regexp and thus missed your point.

The differences between ruby18 and ruby19 are quite interesting
though. Can somebody with deeper knowledge of ruby19 affirm that the
"copy on write" strategy is still in use?

Clifford Heath

unread,

Feb 4, 2008, 1:14:05 PM2/4/08

to

Eric Mahurin wrote:
>> compare:
>> space = Grammar::Element[Set[?\ ,?\t].duck!(:==,:include?)].discard

> So, doing 1..3, in the released grammar v0.5, you'd have this instead:
> space = (E[?\s] | E[?\t]).discard

Definitely an improvement!

>> Nathan's concept is that "grammar" should become a Ruby keyword,

> I don't think ruby needs any new keywords. It already has more than it
> needs in my opinion. There is enough power with just classes, methods,
> blocks, and operator overloading.

Part of the point of using a packrat-style parser is that there are
no true keywords, meaning words that must only be used in their KW
places. Many of Ruby's words are like that, of course. But my point
was that once you say "grammar", you're now talking a different
language, which can use as much and *only* as much of the Ruby
grammar as it needs. And when inside a rule you say {, you're back
in Ruby grammar... etc.

The normal lexer/parser split makes that fairly hard to do, as the
lexer needs to know whether to return the KW meaning or a general
identifier.

>> Ruby DSL is far too awkward to express CQL - which is my primary
>> project at present.
> Don't know anything about CQL to answer. I would like to make what I have
> general enough.

Take a look at the ORM diagrams under the images directory, and
then at the CQL version of the same models, under
<http://activefacts.rubyforge.org/svn/examples/>. You'll see a lot
of what looks like unparsable plain English... but it's not. In
fact, the language is much harder to parse than any of those examples,
I don't have any published examples of queries (these are just
declarations). The CQL parser is in
<http://activefacts.rubyforge.org/svn/lib/activefacts/cql/> - you
might be able to see where it's hard. You can be many tokens into
recognising a fact_clause before realising you're looking at a
"condition".

It's possible that with your lookahead it would be possible though.

>> to see how to do it, whereas it won't ever happen with yours AFAICS.
> It definitely could happen with mine. I do ruby code generation now, but
> could have another Grammar-like class

Cool! I didn't realize you were generating code from this.

> But, if you use the #lookahead method on any
> grammar, it will give a new grammar that handles the case when the grammar
> fails in the middle (backtracks to the position where the grammar started).

You mean it backtracks to the start of the file? Or to where you called
lookahead?

Clifford Heath.

Eric Mahurin

unread,

Feb 4, 2008, 2:13:08 PM2/4/08

to

[Note: parts of this message were removed to make it a legal post.]

On Feb 4, 2008 12:14 PM, Clifford Heath <n...@spam.please.net> wrote:

> Eric Mahurin wrote:
> >> compare:
> >> space = Grammar::Element[Set[?\
> ,?\t].duck!(:==,:include?)].discard
> > So, doing 1..3, in the released grammar v0.5, you'd have this instead:
> > space = (E[?\s] | E[?\t]).discard
>
> Definitely an improvement!
>
> >> Nathan's concept is that "grammar" should become a Ruby keyword,
> > I don't think ruby needs any new keywords. It already has more than it
> > needs in my opinion. There is enough power with just classes, methods,
> > blocks, and operator overloading.
>
> Part of the point of using a packrat-style parser is that there are
> no true keywords, meaning words that must only be used in their KW
> places. Many of Ruby's words are like that, of course. But my point
> was that once you say "grammar", you're now talking a different
> language, which can use as much and *only* as much of the Ruby
> grammar as it needs. And when inside a rule you say {, you're back
> in Ruby grammar... etc.
>

I definitely need to go learn about packrat/PEG stuff. Sounds interesting
after looking at wikipedia. Still don't really understand LR/LALR parsers.
My main focus has been LL/recursive-descent and PEG sounds to be
recursive-descent.

The normal lexer/parser split makes that fairly hard to do, as the
> lexer needs to know whether to return the KW meaning or a general
> identifier.

Fortunately in my grammar package the lexer and parser can be specified in
exactly the same way. Also, you can use anything for tokens. If you are
using pattern to match to them, it will just use the expression
(pattern===next_token) to see if it is a match. So, in the lexer, you might
generate a String for any identifier/keyword. The lexer wouldn't care
whether it is a keyword or not. In the parser, you then could use this to
match an arbitrary identifier:

E(String) # (String===next_token) is true if next_token is a String

or if you wanted to match a keyword, you'd have this type of Grammar:

E("begin") # ("begin"===next_token) is true if next_token=="begin"

You could also have some arbitrary matching function by defining your own
pattern class (with a #== (v0.5) or #=== (dev) method).

>> Ruby DSL is far too awkward to express CQL - which is my primary
> >> project at present.
> > Don't know anything about CQL to answer. I would like to make what I
> have
> > general enough.
>
> Take a look at the ORM diagrams under the images directory, and
> then at the CQL version of the same models, under
> <http://activefacts.rubyforge.org/svn/examples/>. You'll see a lot
> of what looks like unparsable plain English... but it's not. In
> fact, the language is much harder to parse than any of those examples,
> I don't have any published examples of queries (these are just
> declarations). The CQL parser is in
> <http://activefacts.rubyforge.org/svn/lib/activefacts/cql/> - you
> might be able to see where it's hard. You can be many tokens into
> recognising a fact_clause before realising you're looking at a
> "condition".
>
> It's possible that with your lookahead it would be possible though.

At one time I made a mini English parser, so this seems quite doable.

>> to see how to do it, whereas it won't ever happen with yours AFAICS.
> > It definitely could happen with mine. I do ruby code generation now,
> but
> > could have another Grammar-like class
>
> Cool! I didn't realize you were generating code from this.
>
> > But, if you use the #lookahead method on any
> > grammar, it will give a new grammar that handles the case when the
> grammar
> > fails in the middle (backtracks to the position where the grammar
> started).
>
> You mean it backtracks to the start of the file? Or to where you called
> lookahead?

Just back to where that lookahead Grammar started parsing. For other
Grammar's, there is only a one char/token lookahead. If it matches the
first char/token, it is committed to that and will give an exception if a
mismatch occurs later. When you apply lookahead to a Grammar, it starts
buffering the input and will rescue these exceptions. It simply rewinds the
input to the point where this lookahead Grammar started parsing when an
exception is found. That's all there is to it.

When a lexer is used, I don't think this backtracking is needed that often.
I guess that is why this memoization is so important with packrat/PEG
parsers - no lexer.

I have to think about whether memoization would apply or not. It may work
since my stuff carries around very little state (mainly just with regards to
the input and output streams).

Eric Mahurin

unread,

Feb 4, 2008, 6:07:52 PM2/4/08

to

[Note: parts of this message were removed to make it a legal post.]

On Feb 1, 2008 7:55 AM, Ruby Quiz <ja...@grayproductions.net> wrote:

> In honor of that, this week's Ruby Quiz is to write a parser for JSON.

Here is another solution of mine:

http://pastie.caboo.se/147505

In this one, I just made a fast hand-built recursive-descent/LL(1) parser.
This is the kind of parser that I'm trying to get my 'grammar' package to
approach (using lots of optimizations). It uses no Regexp or ruby eval
(both of which have compiled C to help speed). And yet, it is the fastest
pure-ruby JSON parser we've seen (see the recursive descent line below):

ch/s author/gem
---- ----------
- Pawel Radecki (RE, mismatch)
3214 Justin Ethier (RE lexer + ruby eval, fixed number parsing)
4054 Eric Mahurin (Grammar0, no lexer, no parser generation)
4078 Eric I (Treetop, unicode broken)
6534 oksteev (Treetop, mismatches in benchmark)
8313 Clifford Heath (Treetop, had to remove handling of "\/")

17320 Alexander Stedile (RE)

54586 Eric Mahurin (Grammar, no lexer, v0.5)
137989 Paolo Bonzini (RE)
166041 Thomas Link (RE lexer + ruby eval, ruby 1.9 results)
220289 json
223486 Eric Mahurin (Grammar, no lexer, unreleased)
224823 fjson (uses C extensions)
333368 Thomas Link & Paolo Bonzini (RE + eval, unicode broken)

388670 Eric Mahurin (hand-built recursive descent)

553081 Eric Mahurin (Grammar, no lexer, unreleased, w/ ruby2cext)
1522250 json (w/ C extensions)

#
# JSON hand-built recursive descent/LL(1) parser, by Eric Mahurin
#
require 'stringio'

class JSONParser

def parse(s)
@next = (@io=StringIO.new(s)).getc
ws
value(out=[])
ws
raise("EOF expected") if @next
raise(out.inspect) unless out.length==1
out[0]
end

def error(expected, found)
raise("expected #{expected}, found #{found ? ("'"<<found<<?\') :
'EOF'}")
end

def value(out)
if ?\[.equal?(@next)
# array
@next=@io.getc
ws
a = []
unless ?\].equal?(@next)
value(a)
ws
until ?\].equal?(@next)
?\,.equal?(@next) ? (@next=@io.getc) : error("','",
@next)
ws
value(a)
ws
end
end
@next = @io.getc
out << a
elsif ?\{.equal?(@next)
# object
@next=@io.getc
ws
h = {}
unless ?\}.equal?(@next)
?\".equal?(@next) ? string(kv=[]) : error("a string", @next)
ws
?\:.equal?(@next) ? (@next=@io.getc) : error("':'", @next)
ws
value(kv)
ws
h[kv[0]] = kv[1]
until ?\}.equal?(@next)
?,.equal?(@next) ? (@next=@io.getc) : error("','",
@next)
ws
?\".equal?(@next) ? string(kv.clear) : error("a string",
@next)
ws
?\:.equal?(@next) ? (@next=@io.getc) : error("':'",
@next)
ws
value(kv)
ws
h[kv[0]] = kv[1]
end
end
@next = @io.getc
out << h
elsif (?a..?z)===(@next)
# boolean
(s="")<<@next
@next = @io.getc
while (?a..?z)===(@next)
s<<@next;@next=@io.getc
end
out << case s
when "true" then true
when "false" then false
when "null" then nil
else error("'true' or 'false' or 'null'", s)
end
elsif ?\".equal?(@next)
string(out)
else
# number
n = ""
(n<<@next;@next=@io.getc) if ?-.equal?(@next)
?0.equal?(@next) ? (n<<@next;@next=@io.getc) : digits(n)
(?..equal?(@next) ?
(n<<@next;@next=@io.getc;digits(n);exp(n);true) :
exp(n)) ?
(out << n.to_f) :
(out << n.to_i)
end
end

# Flattening any of the methods below will improve performance further

def ws
@next = @io.getc while (case @next;when ?\s,?\t,?\n,?\r;true;end)
end

def digits(out)
(?0..?9)===@next ? (out<<@next;@next=@io.getc) : error("a digit",
@next)
while (?0..?9)===@next; (out<<@next;@next=@io.getc); end
true
end

def exp(out)
(case @next;when ?e,?E;true;end) ? (out<<@next;@next=@io.getc) :
return
(out<<@next;@next=@io.getc) if (case @next;when ?-,?+;true;end)
digits(out)
end

def string(out)
# we've already verified the starting "
@next=@io.getc
s = ""
until ?\".equal?(@next)
if ?\\.equal?(@next)
@next = @io.getc
case @next
when ?\",?\\,?\/ then (s<<@next;@next=@io.getc)
when ?b then (s<<?\b;@next=@io.getc)
when ?f then (s<<?\f;@next=@io.getc)
when ?n then (s<<?\n;@next=@io.getc)
when ?r then (s<<?\r;@next=@io.getc)
when ?t then (s<<?\t;@next=@io.getc)
when ?u
@next = @io.getc
u = ""
4.times {
case @next
when ?0..?9, ?a..?f, ?A..?F
u<<@next;@next=@io.getc
else
error("a hex character", @next)
end
}
s << u.to_i(16)
else
error("a valid escape", @next)
end
else
error("a character", @next) unless @next
s<<@next;@next=@io.getc
end
end
@next = @io.getc
out << s
end

end

James Gray

unread,

Feb 4, 2008, 8:29:43 PM2/4/08

to

Here's my own recursive descent parser (based on the not-quite-correct
quiz tests):

#!/usr/bin/env ruby -wKU

require "strscan"

# http://json.org/
class JSONParser
AST = Struct.new(:value)

def parse(input)
@input = StringScanner.new(input.strip)
parse_value.value
end

private

def parse_value
parse_object or
parse_array or
parse_string or
parse_number or
parse_keyword or
error("Illegal JSON value")
end

def parse_object
if @input.scan(/\{\s*/)
object = Hash.new
while key = parse_string
@input.scan(/\s*:\s*/) or error("Expecting object separator")
object[key.value] = parse_value.value
@input.scan(/\s*,\s*/) or break
end
@input.scan(/\s*\}\s*/) or error("Unclosed object")
AST.new(object)
else
false
end
end

def parse_array
if @input.scan(/\[\s*/)
array = Array.new
while contents = parse_value rescue nil
array << contents.value
@input.scan(/\s*,\s*/) or break
end
@input.scan(/\s*\]\s*/) or error("Unclosed array")
AST.new(array)
else
false
end
end

def parse_string
if @input.scan(/"/)
string = String.new
while contents = parse_string_content || parse_string_escape
string << contents.value
end
@input.scan(/"\s*/) or error("Unclosed string")
AST.new(string)
else
false
end
end

def parse_string_content
@input.scan(/[^\\"]+/) and AST.new(@input.matched)
end

def parse_string_escape
if @input.scan(%r{\\["\\/]})
AST.new(@input.matched[-1])
elsif @input.scan(/\\[bfnrt]/)
AST.new(eval(%Q{"#{@input.matched}"}))
elsif @input.scan(/\\u[0-9a-fA-F]{4}/)
AST.new([Integer("0x#{@input.matched[2..-1]}")].pack("U"))
else
false
end
end

def parse_number
@input.scan(/-?(?:0|[1-9]\d*)(?:\.\d+(?:[eE][+-]?\d+)?)?\b/) and
AST.new(eval(@input.matched))
end

def parse_keyword
@input.scan(/\b(?:true|false|null)\b/) and
AST.new(eval(@input.matched.sub("null", "nil")))
end

def error(message)
if @input.eos?
raise "Unexpected end of input."
else
raise "#{message}: #{@input.peek(@input.string.length)}"
end
end
end

__END__

James Edward Gray II

James Gray

unread,

Feb 4, 2008, 10:52:05 PM2/4/08

to

On Feb 4, 2008, at 7:29 PM, James Gray wrote:

> Here's my own recursive descent parser (based on the not-quite-
> correct quiz tests):

This is a version I built some time ago when experimenting with peggy:

#!/usr/bin/env ruby -wKU

require "lib/parser"
require "lib/builder"

class JSONParser < Peggy::Builder
KEYWORDS = {"true" => true, "false" => false, "null" => nil}
ESCAPES = Hash[*%W[b \b f \f n \n r \r t \t]]

def self.parse(json_string)
parser = self.new

parser.source_text = json_string
parser.parse?(:value) or raise "Failed to parse:
#{json_string.inspect}"

parser.to_ruby
end

def initialize
super

self.ignore_productions = [:space]
space { lit /\s+/ }

value {
seq {
opt { space }
one {
object
array
string
number
keyword
}
opt { space }
}
}

object {
seq {
lit /\{\s*/
one {
seq {
opt { many { seq { string; lit /\s*:/; value; lit /,
\s*/ } } }
seq { string; lit /\s*:/; value }
lit "}"
}
lit "}"
}
}
}

array {
seq {
lit "["
one {
seq {
opt { many { seq { value; lit "," } } }; value; lit "]"
}
lit "]"
}
}
}

string {
seq {
lit '"'
one {
lit '"'
seq {
many {
one {
seq { string_content; opt { escape } }
seq { escape; opt { string_content } }
}
}
lit '"'
}
}
}
}
string_content { lit(/[^\\"]+/) }
escape {
one {
escape_literal
escape_sequence
escape_unicode
}
}

escape_literal { lit(%r{\\["\\/]}) }
escape_sequence { lit(/\\[bfnrt]/) }
escape_unicode { lit(/\\u[0-9a-f]{4}/i) }

number { lit(/-?(?:0|[1-9]\d*)(?:\.\d+(?:[eE][+-]?\d+)?)?\b/) }
keyword { lit(/\b(?:true|false|null)\b/) }
end

def to_ruby(from = parse_results.keys.min)
kind = parse_results[from][:found_order].first
to = parse_results[from][kind]
send("to_ruby_#{kind}", from, to)
end

private

def to_ruby_object(from, to)
p parse_results
object = Hash.new
skip_to = nil
last_key = nil
parse_results.keys.select { |k| k > from and k < to }.sort.each
do |key|
content = parse_results[key]
next if skip_to and key < skip_to
next unless content[:found_order] and
( ( content[:found_order].size == 2 and
content[:found_order][1] == :value ) or
content[:found_order] == [:string] )
if content[:found_order] == [:string]
last_key = to_ruby_string(key, content[:string])
else
case content[:found_order].first
when :object
object[last_key] = to_ruby_object(key, content[:object])
skip_to = content[:object]
when :array
object[last_key] = to_ruby_array(key, content[:array])
skip_to = content[:array]
else
object[last_key] = to_ruby(key)
end
end
end
object
end

def to_ruby_array(from, to)
array = Array.new
skip_to = nil
parse_results.keys.select { |k| k > from and k < to }.sort.each
do |key|
content = parse_results[key]
next if skip_to and key < skip_to
next unless content[:found_order] and
content[:found_order].size == 2 and
content[:found_order][1] == :value
case content[:found_order].first
when :object
array << to_ruby_object(key, content[:object])
skip_to = content[:object]
when :array
array << to_ruby_array(key, content[:array])
skip_to = content[:array]
else
array << to_ruby(key)
end
end
array
end

def to_ruby_string(from, to)
string = String.new
parse_results.keys.select { |k| k > from and k < to }.sort.each
do |key|
content = parse_results[key]
next unless content[:found_order]
case content[:found_order].first
when :string_content
string << source_text[key...content[:string_content]]
when :escape_literal
string << source_text[content[:escape_literal] - 1, 1]
when :escape_sequence
string << ESCAPES[source_text[content[:escape_sequence] - 1,
1]]
when :escape_unicode
string << [Integer("0x#{source_text[key + 2, 4]}")].pack("U")
end
end
string
end

def to_ruby_number(from, to)
num = source_text[from...to]
num.include?(".") ? Float(num) : Integer(num)
end

def to_ruby_keyword(from, to)
KEYWORDS[source_text[from...to]]
end
end

__END__

I guess you can see why that library didn't win me over. :)

James Edward Gray II

James Gray

unread,

Feb 4, 2008, 10:54:51 PM2/4/08

to

On Feb 4, 2008, at 7:29 PM, James Gray wrote:

> Here's my own recursive descent parser (based on the not-quite-
> correct quiz tests):

Finally, here's the JSON parser right out of Ghost Wheel's tests:

#!/usr/bin/env ruby -wKU

JSONParser = GhostWheel.build_parser( %q{
keyword = 'true' { true } | 'false' { false } | 'null' { nil }

number = /-?(?:0|[1-9]\d*)(?:\.\d+(?:[eE][+-]?\d+)?)?/
{ ast.include?(".") ? Float(ast) : Integer(ast) }

string_content = /\\\\["\\\\\/]/ { ast[-1, 1] }
| /\\\\[bfnrt]/
{ Hash[*%W[b \n f \f n \n r \r t \t]][ast[-1, 1]] }
| /\\\\u[0-9a-fA-F]{4}/
{ [Integer("0x#{ast[2..-1]}")].pack("U") }
| /[^\\\\"]+/
string = '"' string_content* '"' { ast.flatten[1..-2].join }

array_content = value_with_array_sep+ value
{ ast[0].inject([]) { |a, v| a.push(*v) } +
ast[-1..-1] }
| value? { ast.is_a?(EmptyParseResult) ? [] : [ast] }
array = /\[\s*/ array_content /\s*\]/ { ast[1] }

object_pair = string /\s*:\s*/ value { {ast[0] => ast[-1]} }
object_pair_and_sep = object_pair /\s*;\s*/ { ast[0] }
object_content = object_pair_and_sep+ object_pair
{ ast.flatten }
| object_pair?
{ ast.is_a?(EmptyParseResult) ? [] : [ast] }
object = /\\\{\s*/ object_content /\\\}\s*/
{ ast[1].inject({}) { |h, p| h.merge(p) } }

value_space = /\s*/
value_content = keyword | number | string | array | object
value = value_space value_content value_space
{ ast[1] }
value_with_array_sep = value /\s*,\s*/ { ast[0] }

json := value EOF { ast[0] }
} )

Eric Mahurin

unread,

Feb 5, 2008, 12:02:29 AM2/5/08

to

[Note: parts of this message were removed to make it a legal post.]

On Feb 4, 2008 7:29 PM, James Gray <ja...@grayproductions.net> wrote:

> Here's my own recursive descent parser (based on the not-quite-correct
> quiz tests):
>

Hi James,

I benchmarked your code. I also was curious how well a good hand-crafted
Regexp recursive descent parser (using the StringScanner) would compare to
my hand-crafted LL(1) (1 character lookahead) parser. So, I took your code
and applied some of the same optimizations that I used:
- minimize method calls (inline at least where a method is called once)
- minimize object creation (put results in the callers output buffer
instead of returning an AST)
- minimize exceptions

Here are the benchmark results with this and the other parsers you posted
(couldn't get the ghostwheel parser to work well):

ch/s author/gem
---- ----------
- Pawel Radecki (RE, couldn't parse benchmark JSON)
- ghostwheel (ghostwheel, couldn't parse benchmark JSON)
1226 James Edward Gray II (peggy, fails one test)
3214 Justin Ethier (RE lexer, ruby eval, fixed number parsing)

4054 Eric Mahurin (Grammar0, no lexer, no parser generation)
4078 Eric I (Treetop, unicode broken)
6534 oksteev (Treetop, mismatches in benchmark)
8313 Clifford Heath (Treetop, had to remove handling of "\/")

17320 Alexander Stedile (RE, recursive descent)

54586 Eric Mahurin (Grammar, no lexer, v0.5)

137989 Paolo Bonzini (RE, recursive descent)
166041 Thomas Link (RE lexer, ruby eval, ruby 1.9 results)
186042 James Edward Gray II (RE, recursive descent)
220289 json (pure ruby version)

223486 Eric Mahurin (Grammar, no lexer, unreleased)
224823 fjson (uses C extensions)

287292 James Edward Gray II (RE, recursive descent, Eric optimized)

333368 Thomas Link & Paolo Bonzini (RE + eval, unicode broken)

388670 Eric Mahurin (recursive descent)

553081 Eric Mahurin (Grammar, no lexer, unreleased, w/ ruby2cext)
1522250 json (w/ C extensions)

I'd like to see a RACC parser for JSON.

Here is the optimized version of your recursive-descent Regexp/StringScanner
parser:

require "strscan"

class JSONParser

def parse(input)
@input = StringScanner.new(input.strip)

parse_value(out=[]) or error("Illegal JSON value")
out[0]
end

private

def parse_value(out)
if @input.scan(/\{\s*/)
object = {}
kv = []
while @input.scan(/"/)
parse_string(kv)

@input.scan(/\s*:\s*/) or error("Expecting object separator")

parse_value(kv) or error("Illegal JSON value")
object[kv[0]] = kv[1]

@input.scan(/\s*,\s*/) or break

kv.clear

end
@input.scan(/\s*\}\s*/) or error("Unclosed object")

out << object
elsif @input.scan(/\[\s*/)
array = []
while parse_value(array)

@input.scan(/\s*,\s*/) or break
end
@input.scan(/\s*\]\s*/) or error("Unclosed array")

out << array
elsif @input.scan(/"/)
parse_string(out)
elsif @input.scan(/-?(?:0|[1-9]\d*)(?:\.\d+(?:[eE][+-]?\d+)?)?\b/)
out << eval(@input.matched)
elsif @input.scan(/\b(?:true|false|null)\b/)
out << eval(@input.matched.sub("null", "nil"))
end
end

def parse_string(out)
string = ""
while true
if @input.scan(/[^\\"]+/)
string << @input.matched
elsif @input.scan(%r{\\["\\/]})
string << @input.matched[-1]
elsif @input.scan(/\\[bfnrt]/)
string << eval(%Q{"#{@input.matched}"})

elsif @input.scan(/\\u[0-9a-fA-F]{4}/)

string << [Integer("0x#{@input.matched[2..-1]}")].pack("U")
else
break
end

end
@input.scan(/"\s*/) or error("Unclosed string")

out << string

tho_mica_l

unread,

Feb 5, 2008, 12:55:44 AM2/5/08

to

It seems though, if I may say so, that *cough* certain solutions don't
pass all test cases:

assert_raise(RuntimeError) { @parser.parse(%{[], p "Foo"}) }
assert_raise(RuntimeError) { @parser.parse(%{""; p 123; "Foo"}) }

assert_raise(RuntimeError) { @parser.parse(%{"" p 123; ""}) }

From the original set:
assert_raises(RuntimeError) { @parser.parse("-5.-4") }
assert_raises(RuntimeError) { @parser.parse(%Q{{ "a" : 2, }}) }
assert_raise(RuntimeError) { @parser.parse(%q{true false}) }

My eval() based solution seems to fail on A Stedile's test case unless
the more strict rfc-rules, which allow only array and object at the
top level, are applied:
assert_raise(RuntimeError) { @parser.parse(%Q{"a" "b"}) }

Thomas.

Paolo Bonzini

unread,

Feb 5, 2008, 2:23:38 AM2/5/08

to

> The differences between ruby18 and ruby19 are quite
> interesting though. Can somebody with deeper knowledge of
> ruby19 affirm that the
> "copy on write" strategy is still in use?

I think it's just that ?<...> is slower than $n. Could you try
comparing my solution (the one based on yours) on both 1.8 and 1.9?

Thanks!

Paolo

tho_mica_l

unread,

Feb 5, 2008, 4:05:49 AM2/5/08

to

> I think it's just that ?<...> is slower than $n. Could you try
> comparing my solution (the one based on yours) on both 1.8 and 1.9?

That's true of course (i_18 is the $n version):

$ ruby benchmark_eric_mahurin.rb i_18
"quiz155i_18"
205653 chars/second

$ ruby19 benchmark_eric_mahurin.rb i_18
"quiz155i_18"
271050 chars/second

$ ruby19 benchmark_eric_mahurin.rb i
"quiz155i"
207100 chars/second

But just having the pleasure to be able to use more than 9 groups and
not to have to care about group order IMHO justifies the use of named
groups. It wasn't just once that I wasted time with searching for the
source of some problem just to find out that the group order changed
or
that the group numbers between, well, only mostly similar
programmatically generated regexps differed.

But the reason why I'm asking is that, e.g., your solution finishes
the
benchmark with 107323 chars/second when run with ruby18. With ruby19
though, it's astounishing 1703 chars/second. I'm not sure though what
is
causing this slowdown. Also I'm wondering if this isn't an artefact
of
the benchmark. The full run looks like this:

user system total real
...
8 24142 0.541000 0.000000 0.541000 ( 0.601000)
9 25988 0.621000 0.000000 0.621000 ( 0.641000)
10 588993246.555000 93.324000 339.879000 (345.657000)
1703 chars/second

So the slowdown only happens after Step #9.

My version that uses Regexp#match() (but still eval based) makes:

user system total real
...
7 3518 0.090000 0.000000 0.090000 ( 0.120000)
8 24142 2.894000 0.010000 2.904000 ( 2.934000)
8228 chars/second

But the time limit is exceeded at step #8 already. This makes me
wonder
what is causing this "interesting" behaviour of your solution at
step #10 when run with ruby19.

BTW, my own benchmarks (using random object generation of different
lenght based on Ara's proposal but slightly enhanced) show the
following
picture:

First the net timing spent on random object generation alone:
user system total real
10 2.003000 0.000000 2.003000 ( 2.033000)
20 6.799000 0.000000 6.799000 ( 6.940000)
30 17.636000 0.000000 17.636000 ( 17.786000)

10: n=10000 avg.size=47
20: n=10000 avg.size=159
30: n=10000 avg.size=406

"quiz155i" (my original submission)
user system total real
10 3.495000 0.000000 3.495000 ( 3.535000)
20 10.024000 0.000000 10.024000 ( 10.095000)
30 25.988000 0.000000 25.988000 ( 26.027000)

10: n=10000 avg.size=46
20: n=10000 avg.size=148
30: n=10000 avg.size=391

"quiz155i_18" (modifyied for 18 compatibility)
user system total real
10 3.345000 0.000000 3.345000 ( 3.385000)
20 9.964000 0.000000 9.964000 ( 10.014000)
30 24.455000 0.000000 24.455000 ( 24.506000)

10: n=10000 avg.size=47
20: n=10000 avg.size=157
30: n=10000 avg.size=396

"quiz155b" (json based, i.e. ragel-based C extension)
user system total real
10 2.263000 0.010000 2.273000 ( 2.303000)
20 7.351000 0.000000 7.351000 ( 7.391000)
30 18.636000 0.000000 18.636000 ( 18.687000)

10: n=10000 avg.size=46
20: n=10000 avg.size=156
30: n=10000 avg.size=399

"solution_paolo_bonzini.rb"
user system total real
10 4.226000 0.000000 4.226000 ( 4.256000)
20 13.349000 0.000000 13.349000 ( 13.450000)
30 34.470000 0.070000 34.540000 ( 36.001000)

10: n=10000 avg.size=47
20: n=10000 avg.size=154
30: n=10000 avg.size=388

BTW I also measured steve's version of a treetop parser but with 1000
iterations only:

"solution_steve.rb"
user system total real
10 5.037000 0.000000 5.037000 ( 5.068000)
20 14.651000 0.010000 14.661000 ( 14.961000)
30 40.298000 0.020000 40.318000 ( 41.330000)

10: n=1000 avg.size=48
20: n=1000 avg.size=148
30: n=1000 avg.size=407

What would the size of an average json snippet an ajax app has to deal
with be? I'm not in the webapp development buisness but from my
understanding this would be rather small, wouldn't it?

Regards,
Thomas.

tho_mica_l

unread,

Feb 5, 2008, 4:06:28 AM2/5/08

to

> I think it's just that ?<...> is slower than $n. Could you try
> comparing my solution (the one based on yours) on both 1.8 and 1.9?

That's true of course (i_18 is the $n version):

Paolo Bonzini

unread,

Feb 5, 2008, 6:04:03 AM2/5/08

to

> But the reason why I'm asking is that, e.g., your
> solution finishes the
> benchmark with 107323 chars/second when run with ruby18.
> With ruby19 though, it's astounishing 1703 chars/second.

Uh-oh. Someone should write to ruby-core?

Paolo

Eric Mahurin

unread,

Feb 5, 2008, 10:02:37 AM2/5/08

to

[Note: parts of this message were removed to make it a legal post.]

On Feb 5, 2008 3:09 AM, tho_mica_l <mica...@gmail.com> wrote:

> Also I'm wondering if this isn't an artefact
> of
> the benchmark. The full run looks like this:
>
> user system total real
> ...
> 8 24142 0.541000 0.000000 0.541000 ( 0.601000)
> 9 25988 0.621000 0.000000 0.621000 ( 0.641000)

> 10 588993 246.555000 93.324000 339.879000 (345.657000)
> 1703 chars/second
>

My baseline assumption was that runtime was relatively linear with respect
to the data size. This assumption is broken with the above case (I think I
noticed this too at some point). Going from a depth of 9 to 10 increased
the length by ~20X, but the runtime went up by ~400X. There is obviously an
O(n*n) component in there (20*20=400). Sounds like there is a ruby1.9problem.

In the benchmark, you could move the print of the performance to inside the
loop, right before the break. If there is a consistent downward trend in
chars/second, you may have an O(n*n) solution and chars/second makes no
sense (for arbitrary data size). Otherwise, maybe we should be looking at
the best performance between the longest two data sizes so that there is no
penalty for a solution getting to a larger but possibly more difficult
dataset. Running this test multiple times (maybe with 4.times{} around the
whole benchmark - including creating the generator) would also be good.

What would the size of an average json snippet an ajax app has to deal
> with be? I'm not in the webapp development buisness but from my
> understanding this would be rather small, wouldn't it?

Maybe, but then making a fast parser wouldn't be any fun :)

tho_mica_l

unread,

Feb 5, 2008, 12:41:04 PM2/5/08

to

> Maybe, but then making a fast parser wouldn't be any fun :)

Since I ran my first preliminary benchmark I have been asking myself
how big the advantage of a C-based parser would actually be. So I
elaborated a little bit on this question. In order to also answer the
question how your solutions "scale", I cleaned up my benchmarks a
little bit. The following includes all submissions that I could make
run with ruby19 -- for whatever reason. I don't have json for ruby18
installed, which is why I didn't run this test with ruby18.

The objects are generated before the test. The tests are run in a
tight loop, the influence of the benchmarking code should thus be
rather marginal.

Objects were generated the JSON representation of which adds up to
about 2MB in 4 different chunk sizes ranging from about 45 to 900
bytes. The object set is identical for all solutions, the numbers are
thus quite comparable. Since the figures differ slightly from Eric
Mahurin's benchmark it's possible that I did something wrong. But in
this case I did it equally wrong for all solutions. The code is down
below.

Regards,
Thomas.

Input chunks:
10: n=43475 avg.size=46.01 tot.size=2000236
20: n=12856 avg.size=155.61 tot.size=2000543
30: n=4897 avg.size=408.51 tot.size=2000483
40: n=2236 avg.size=894.47 tot.size=2000045

Ruby19 json
user system total real
10 2.274000 0.000000 2.274000 ( 2.294000)
20 1.402000 0.000000 1.402000 ( 1.432000)
30 1.041000 0.000000 1.041000 ( 1.061000)
40 1.282000 0.000000 1.282000 ( 1.302000)

10 871942 chars/sec (2000236/2.29)
20 1397027 chars/sec (2000543/1.43)
30 1885469 chars/sec (2000483/1.06)
40 1536132 chars/sec (2000045/1.30)

"solution_tml.rb"
user system total real
10 8.452000 0.010000 8.462000 ( 8.633000)
20 6.570000 0.000000 6.570000 ( 6.599000)
30 6.068000 0.000000 6.068000 ( 6.119000)
40 5.659000 0.000000 5.659000 ( 5.698000)

10 231696 chars/sec (2000236/8.63)
20 303158 chars/sec (2000543/6.60)
30 326929 chars/sec (2000483/6.12)
40 351008 chars/sec (2000045/5.70)

"solution_tml_pb.rb" (modified by P Bonzini)
user system total real
10 8.151000 0.000000 8.151000 ( 8.192000)
20 5.849000 0.000000 5.849000 ( 5.879000)
30 5.307000 0.000000 5.307000 ( 5.337000)
40 5.238000 0.000000 5.238000 ( 5.268000)

10 244169 chars/sec (2000236/8.19)
20 340286 chars/sec (2000543/5.88)
30 374832 chars/sec (2000483/5.34)
40 379659 chars/sec (2000045/5.27)

"solution_eric_i.rb"
user system total real
10158.318000 0.040000 158.358000 (158.798000)
20162.133000 0.030000 162.163000 (162.845000)
30170.305000 0.030000 170.335000 (170.525000)
40193.187000 0.070000 193.257000 (193.458000)

10 12596 chars/sec (2000236/158.80)
20 12284 chars/sec (2000543/162.85)
30 11731 chars/sec (2000483/170.53)
40 10338 chars/sec (2000045/193.46)

"solution_eric_mahurin3.rb"
user system total real
10 7.631000 0.000000 7.631000 ( 7.641000)
20 6.319000 0.000000 6.319000 ( 6.329000)
30 6.179000 0.000000 6.179000 ( 6.179000)
40 5.769000 0.000000 5.769000 ( 5.778000)

10 261776 chars/sec (2000236/7.64)
20 316091 chars/sec (2000543/6.33)
30 323755 chars/sec (2000483/6.18)
40 346148 chars/sec (2000045/5.78)

"solution_james_gray.rb"
user system total real
10 13.820000 0.000000 13.820000 ( 13.890000)
20 12.117000 0.000000 12.117000 ( 12.138000)
30 12.909000 0.000000 12.909000 ( 12.918000)
40 15.051000 0.010000 15.061000 ( 15.082000)

10 144005 chars/sec (2000236/13.89)
20 164816 chars/sec (2000543/12.14)
30 154860 chars/sec (2000483/12.92)
40 132611 chars/sec (2000045/15.08)

"solution_justin_ethier.rb"
user system total real
10 17.025000 0.000000 17.025000 ( 17.025000)
20 17.915000 0.040000 17.955000 ( 17.985000)
30 28.001000 0.021000 28.022000 ( 28.041000)
40 51.253000 0.070000 51.323000 ( 51.394000)

10 117488 chars/sec (2000236/17.03)
20 111233 chars/sec (2000543/17.98)
30 71341 chars/sec (2000483/28.04)
40 38915 chars/sec (2000045/51.39)

"solution_paolo_bonzini.rb"
user system total real

10 11.036000 0.000000 11.036000 ( 11.036000)
20 17.045000 0.030000 17.075000 ( 17.104000)
30 32.717000 0.020000 32.737000 ( 32.857000)
40 69.119000 0.070000 69.189000 ( 69.310000)

10 181246 chars/sec (2000236/11.04)
20 116963 chars/sec (2000543/17.10)
30 60884 chars/sec (2000483/32.86)
40 28856 chars/sec (2000045/69.31)

"solution_steve.rb"
user system total real

10210.152000 0.040000 210.192000 (210.573000)
20215.260000 0.060000 215.320000 (215.590000)
30223.201000 0.110000 223.311000 (228.368000)
40241.257000 0.260000 241.517000 (248.868000)

10 9499 chars/sec (2000236/210.57)
20 9279 chars/sec (2000543/215.59)
30 8759 chars/sec (2000483/228.37)
40 8036 chars/sec (2000045/248.87)

Benchmark code:

require 'benchmark'
# require 'json/pure'
require 'json'

N = 2000
S = [10, 20, 30, 40]

# This is a slightly enhanced version of Ara's object generator.
# Objects are generated via RandomObject.generate(nil, DEPTH)
# -- the first argument defines which object types are eligible
# and can be ignored in this context.
require 'tml/random-object'

puts 'Preparing objects ...'
sizes = Hash.new
objects = S.inject({}) do |h, s|
size = 0
a = h[s] = []
n = N * 1000
while size < n
o = RandomObject.generate(nil, s)
j = o.to_json
a << [o, j]
size += j.size
end
sizes[s] = size.to_f
h
end

throughput = Hash.new {|h, k| h[k] = Hash.new(0)}

ARGV.each do |arg|
p arg
require arg

parser = JSONParser.new

throughput = []
Benchmark.bm do |b|
S.each do |s|
t = b.report(s.to_s) do |sn|
objects[s].each do |o, j|
if o != parser.parse(j)

raise RuntimeError
end
end
end

throughput << "%s %d chars/sec (%d/%0.2f)" % [s,
sizes[s] / t.real, sizes[s], t.real]
end
end
puts
puts throughput.join("\n")
puts
puts

end

objects.each do |s, z|
puts "%s: n=%d avg.size=%0.2f tot.size=%d" %
[s, z.size, sizes[s].to_f / z.size, sizes[s]]

end
puts

Eric Mahurin

unread,

Feb 5, 2008, 2:00:26 PM2/5/08

to

[Note: parts of this message were removed to make it a legal post.]

On Feb 5, 2008 11:44 AM, tho_mica_l <mica...@gmail.com> wrote:

> > Maybe, but then making a fast parser wouldn't be any fun :)
>

> Since the figures differ slightly from Eric
> Mahurin's benchmark it's possible that I did something wrong. But in
> this case I did it equally wrong for all solutions. The code is down
> below.

We probably should probably assume all of these benchmarks have +-50%
error. The performance is highly data-set and phase-of-the-moon dependent.
You can still judge whether something has non-linear performance (i.e.
quadratic runtime) or judge whether one solution is 5-10X faster than
another. But, if two solutions are within 2X of each other in a benchmark,
I don't think there is a clear winner.

It does look like some solutions have quadratic runtime on ruby 1.9. I
didn't observe this on 1.8.6.

I added all of the unit tests I found in this thread, plus this one:

def test_int_parsing
assert_same(0, @parser.parse("0"))
assert_same(42, @parser.parse("42"))
assert_same(-13, @parser.parse("-13"))
end

and removed these that don't seem correct:

#assert_raise(RuntimeError) { @parser.parse(%{"\u0022; p 123;
\u0022Busted"}) }
#assert_equal("\\u0022; p 123; \u0022Busted",
# @parser.parse(%{"\\u0022; p 123; \\u0022Busted"}))

Here is a tally of failures(F) and errors(F) using this expanded unit test
suite:

ch/s F E author/gem
---- - - ----------
- 5 0 Pawel Radecki (RE, recursive descent)
- 6 2 ghostwheel (ghostwheel)
1226 3 2 James Edward Gray II (peggy)
3214 5 1 Justin Ethier (RE lexer, ruby eval, fixed numbers)
4054 0 0 Eric Mahurin (Grammar0, no lexer, no parser generation)
4078 2 0 Eric I (Treetop, unicode broken)
6534 2 0 Steve (Treetop, mismatches in benchmark)
8313 1 1 Clifford Heath (Treetop, removed handling of "\/")
17320 0 0 Alexander Stedile (RE, recursive descent)
54586 0 0 Eric Mahurin (Grammar, no lexer, v0.5)
137989 2 1 Paolo Bonzini (RE, recursive descent)
166041 2 1 Thomas Link (RE lexer, ruby eval, ruby 1.9 results)
186042 5 0 James Edward Gray II (RE, recursive descent)
220289 1 7* json
223486 0 0 Eric Mahurin (Grammar, no lexer, unreleased)
224823 6 0 fjson (uses C extensions)
287292 5 0 James Edward Gray II (RE, recursive, Eric optimized)
333368 3 0 Thomas Link & Paolo Bonzini (RE + eval, unicode broken)
388670 0 0 Eric Mahurin (recursive descent)
553081 4 9 Eric Mahurin (Grammar, no lexer, unreleased, ruby2cext)
1522250 0 7* json (w/ C extensions)

For the json gem, all of the failures happen because the tests are invalid -
top-level json should only be an array or an object.

My Grammar with ruby2cext didn't work well with unit testing because it
didn't handle creating the parser multiple times. Need to fix that.

Has anyone been able to benchmark the ghostwheel json parser? I would like
to see how well it does.

Here is the complete set of unit tests I used:

require "test/unit"

class TestJSONParser < Test::Unit::TestCase
def setup
@parser = JSONParser.new
end

def test_keyword_parsing
assert_equal(true, @parser.parse("true"))
assert_equal(false, @parser.parse("false"))
assert_equal(nil, @parser.parse("null"))
end

def test_number_parsing
assert_equal(42, @parser.parse("42"))
assert_equal(-13, @parser.parse("-13"))
assert_equal(3.1415, @parser.parse("3.1415"))
assert_equal(-0.01, @parser.parse("-0.01"))

assert_equal(0.2e1, @parser.parse("0.2e1"))
assert_equal(0.2e+1, @parser.parse("0.2e+1"))
assert_equal(0.2e-1, @parser.parse("0.2e-1"))
assert_equal(0.2E1, @parser.parse("0.2e1"))
end

def test_string_parsing
assert_equal(String.new, @parser.parse(%Q{""}))
assert_equal("JSON", @parser.parse(%Q{"JSON"}))

assert_equal( %Q{nested "quotes"},
@parser.parse('"nested \"quotes\""') )
assert_equal("\n", @parser.parse(%Q{"\\n"}))
assert_equal( "a",
@parser.parse(%Q{"\\u#{"%04X" % ?a}"}) )
end

def test_array_parsing
assert_equal(Array.new, @parser.parse(%Q{[]}))
assert_equal( ["JSON", 3.1415, true],
@parser.parse(%Q{["JSON", 3.1415, true]}) )
assert_equal([1, [2, [3]]], @parser.parse(%Q{[1, [2, [3]]]}))
end

def test_object_parsing
assert_equal(Hash.new, @parser.parse(%Q{{}}))
assert_equal( {"JSON" => 3.1415, "data" => true},
@parser.parse(%Q{{"JSON": 3.1415, "data": true}}) )
assert_equal( { "Array" => [1, 2, 3],
"Object" => {"nested" => "objects"} },
@parser.parse(<<-END_OBJECT) )
{"Array": [1, 2, 3], "Object": {"nested": "objects"}}
END_OBJECT
end

def test_parse_errors
assert_raise(RuntimeError) { @parser.parse("{") }
assert_raise(RuntimeError) { @parser.parse(%q{{"key": true false}}) }

assert_raise(RuntimeError) { @parser.parse("[") }
assert_raise(RuntimeError) { @parser.parse("[1,,2]") }

assert_raise(RuntimeError) { @parser.parse(%Q{"}) }
assert_raise(RuntimeError) { @parser.parse(%Q{"\\i"}) }

assert_raise(RuntimeError) { @parser.parse("$1,000") }
assert_raise(RuntimeError) { @parser.parse("1_000") }
assert_raise(RuntimeError) { @parser.parse("1K") }

assert_raise(RuntimeError) { @parser.parse("unknown") }
end

def test_int_parsing
assert_same(0, @parser.parse("0"))
assert_same(42, @parser.parse("42"))
assert_same(-13, @parser.parse("-13"))
end

def test_more_numbers
assert_equal(5, @parser.parse("5"))
assert_equal(-5, @parser.parse("-5"))
assert_equal 45.33, @parser.parse("45.33")
assert_equal 0.33, @parser.parse("0.33")
assert_equal 0.0, @parser.parse("0.0")
assert_equal 0, @parser.parse("0")
assert_raises(RuntimeError) { @parser.parse("-5.-4") }
assert_raises(RuntimeError) { @parser.parse("01234") }
assert_equal(0.2e1, @parser.parse("0.2E1"))
assert_equal(42e10, @parser.parse("42E10"))
end

def test_more_string
assert_equal("abc\befg", @parser.parse(%Q{"abc\\befg"}))
assert_equal("abc\nefg", @parser.parse(%Q{"abc\\nefg"}))
assert_equal("abc\refg", @parser.parse(%Q{"abc\\refg"}))
assert_equal("abc\fefg", @parser.parse(%Q{"abc\\fefg"}))
assert_equal("abc\tefg", @parser.parse(%Q{"abc\\tefg"}))
assert_equal("abc\\efg", @parser.parse(%Q{"abc\\\\efg"}))
assert_equal("abc/efg", @parser.parse(%Q{"abc\\/efg"}))
end

def test_more_object_parsing
assert_equal({'a'=>2,'b'=>4}, @parser.parse(%Q{{ "a" : 2 , "b":4 }}))

assert_raises(RuntimeError) { @parser.parse(%Q{{ "a" : 2, }}) }

assert_raises(RuntimeError) { @parser.parse(%Q{[ "a" , 2, ]}) }
end

def test_alexander

assert_raise(RuntimeError) { @parser.parse(%Q{"a" "b"}) }

end

def test_thomas
assert_raise(RuntimeError) { @parser.parse(%{p "Busted"}) }
assert_raise(RuntimeError) { @parser.parse(%{[], p "Busted"}) }
assert_raise(RuntimeError) { @parser.parse(%{[p "Busted"]}) }
assert_raise(RuntimeError) { @parser.parse(%{{1 => STDOUT.puts("Busted")}})
}
#assert_raise(RuntimeError) { @parser.parse(%{"\u0022; p 123;
\u0022Busted"}) }

assert_raise(RuntimeError) { @parser.parse(%{"" p 123; ""}) }

#assert_equal("\\u0022; p 123; \u0022Busted",
# @parser.parse(%{"\\u0022; p 123; \\u0022Busted"}))
assert_equal('#{p 123}', @parser.parse(%q{"#{p 123}"}))
assert_equal(['#{`ls -r`}'], @parser.parse(%q{["#{`ls -r`}"]}))
assert_equal('#{p 123}', @parser.parse(%q{"\\u0023{p 123}"}))
assert_equal('#{p 123}', @parser.parse(%q{"\u0023{p 123}"}))
end

def test_thomas2

assert_raise(RuntimeError) { @parser.parse(%{[], p "Foo"}) }
assert_raise(RuntimeError) { @parser.parse(%{""; p 123; "Foo"}) }
assert_raise(RuntimeError) { @parser.parse(%{"" p 123; ""}) }

assert_raises(RuntimeError) { @parser.parse("-5.-4") }

assert_raises(RuntimeError) { @parser.parse(%Q{{ "a" : 2, }}) }
assert_raise(RuntimeError) { @parser.parse(%q{true false}) }

end

James Gray

unread,

Feb 5, 2008, 4:14:20 PM2/5/08

to

On Feb 5, 2008, at 1:00 PM, Eric Mahurin wrote:

> Has anyone been able to benchmark the ghostwheel json parser?

Sorry about that. GhostWheel doesn't need to instantiate the parser
before calling parse(). Drop the .new in your setup and it will work.

James Edward Gray II

Clifford Heath

unread,

Feb 5, 2008, 5:32:21 PM2/5/08

to

Eric Mahurin wrote:
> I definitely need to go learn about packrat/PEG stuff. Sounds interesting
> after looking at wikipedia. Still don't really understand LR/LALR parsers.
> My main focus has been LL/recursive-descent and PEG sounds to be
> recursive-descent.

Yes, it's a simple concept. I give a little comparison of parsing
techniques which has the flavour of how LR parsing works in my
presentation on Treetop, <http://dataconstellation.com/blog>.
you'll need to grab the audio though.

You can add packrat behaviour any recursive-descent LL parser with
backtracking simply by including, at the start of each rule, code
that says:

return m if (m = memo[location, :rule_name])
start_location = location

and before any other return statement,

return memo[start_location, :rule_name] = parse_result

I'd love to see you bend your considerable mind towards working
out how to minimise the memoization in a packrat parser. The sheer
number of objects created is clearly the performance-limiting
factor.

Clifford Heath.

Eric Mahurin

unread,

Feb 5, 2008, 6:27:00 PM2/5/08

to

[Note: parts of this message were removed to make it a legal post.]

I figured that part out, but there were some significant bugs in the
ghostwheel grammar spec:
* was using ";" instead of "," between key:value pairs
* \b was converted to \n
* not handling number with exponent and fractional part

I fixed those, but it still has a bug where it sometimes spits out arrays
where an object/hash should be. I'm just skipped the self-checking in my
benchmark to get some results.

I added a new column to the results to show how much coding is needed for
the parser. I stripped out comments and leading whitespace and measured
gzip size. For the parser generators, I only measured the parser spec (i.e.
treetop file) size. Here are the results:

ch/s F E gzip author/gem
------- - - ---- ----------
- 5 0 545 Pawel Radecki (RE, recursive descent)
1226 3 2 1074 James Edward Gray II (peggy)
3214 5 1 683 Justin Ethier (RE lexer, ruby eval, fixes)
4054 0 0 608 Eric Mahurin (Grammar0, no lexer, no code-gen)
4076 6 2 588 ghostwheel (ghostwheel, fixes)
4078 2 0 706 Eric I (Treetop, unicode broken)
6534 2 0 631 Steve (Treetop, mismatches in benchmark)
8313 1 1 545 Clifford Heath (Treetop, removed handling of "\/")
17320 0 0 842 Alexander Stedile (RE, recursive descent)
54586 0 0 723 Eric Mahurin (Grammar, no lexer, v0.5)
137989 2 1 660 Paolo Bonzini (RE, recursive descent)
166041 2 1 445 Thomas Link (RE lexer, ruby eval, ruby 1.9 results)
186042 5 0 685 James Edward Gray II (RE, recursive descent)
220289 1 0 - json
223486 0 0 653 Eric Mahurin (Grammar, no lexer, unreleased)
224823 6 0 - fjson (uses C extensions)
287292 5 0 606 James Edward Gray II (RE, recursive, Eric optimized)
333368 3 0 405 Thomas Link & Paolo Bonzini (RE + eval)
388670 0 0 827 Eric Mahurin (recursive descent)
553081 4 9 653 Eric Mahurin (Grammar, no lexer, unreleased, ruby2cext)
1522250 0 0 - json (w/ C extensions)

Notice that there isn't much advantage to use a parser generator in this
case. Since JSON is relatively simple, you don't save much (or any) coding
by using a parser generator.

James Gray

unread,

Feb 5, 2008, 6:32:36 PM2/5/08

to

On Feb 5, 2008, at 5:27 PM, Eric Mahurin wrote:

> I figured that part out, but there were some significant bugs in the
> ghostwheel grammar spec:
> * was using ";" instead of "," between key:value pairs
> * \b was converted to \n
> * not handling number with exponent and fractional part
>
> I fixed those, but it still has a bug where it sometimes spits out
> arrays
> where an object/hash should be. I'm just skipped the self-checking
> in my
> benchmark to get some results.

Oops. ;)

Thanks for figuring most of it out.

> Notice that there isn't much advantage to use a parser generator in
> this
> case. Since JSON is relatively simple, you don't save much (or any)
> coding by using a parser generator.

Right. One of the main points of recursive descent parsing is that it
is supposed to be easy to hand roll. For many cases, you really don't
need the parser generator to hold your hand. Of course, Treetop adds
the nice memoization and whatnot. Tradeoffs.

James Edward Gray II

Eric Mahurin

unread,

Feb 5, 2008, 7:26:53 PM2/5/08

to

[Note: parts of this message were removed to make it a legal post.]

A point I would like to make based on these benchmarks is that you can build
a very efficient parser (or lexer) without using Regexp. The fastest (on my
machine and my benchmark) pure-ruby solution doesn't use a single Regexp -
it just is a straight single character lookahead recursive descent parser.
I think the benefit of Regexp goes down the more of them you are using.

Not using Regexp also gives a lot more flexibility. A Regexp unfortunately
only operates on a String. This makes it very difficult to deal with files
when a given Regexp may cover an arbitrary number of lines (like when
parsing a string or a multi-line comment). With Regexp, you find it easiest
and safest to read the entire file/stream into a String first. IMHO, a
"real" parser should not need to read the entire file into memory.

Eric

Clifford Heath

unread,

Feb 5, 2008, 7:42:21 PM2/5/08

to

Eric Mahurin wrote:
> With Regexp, you find it easiest
> and safest to read the entire file/stream into a String first. IMHO, a
> "real" parser should not need to read the entire file into memory.

Agreed. It should however be possible to modify a Regexp engine to
work on an IO stream, reading more input only as required. Because
of the lookahead required, it is necessary to be able to wind back
excessive input that has been read.

Eric Mahurin

unread,

Feb 5, 2008, 8:42:15 PM2/5/08

to

[Note: parts of this message were removed to make it a legal post.]

When I first started writing my parser generator, I wanted it to be able to
handle a regex at the leaf. I tried putting some layer on top, then I
complained about the inflexibility of regex. Finally I gave up and decided
not to use them. I think I ended up with something much cleaner since a
regex is already a mini-parser. Using two parsing languages (low-level
regexp and high-level BNF-like thing) just wasn't as appealing anymore.
Fortunately, I found out much later that this decision didn't really cost
anything in terms of performance.

But, yes regex should be changed to be more flexible. In my opinion, it
should be duck-typed to work on any String-like object (using a subset of
string methods - #[] would be minimal). It's not to difficult to wrap an IO
to look like a read-only String. It would of course optimize the real
String case, but that shouldn't effect the functionality.

Even C++ has "duck-typed" a regex:

http://www.boost.org/libs/regex/doc/index.html

This works on arbitrary characters and arbitrary (external) iterators (into
containers). Of course it uses the ugly template syntax to map to static
types.

Clifford Heath

unread,

Feb 5, 2008, 9:56:21 PM2/5/08

to

Eric Mahurin wrote:
> Fortunately, I found out much later that this decision didn't really cost
> anything in terms of performance.

I submit that if your code is as fast as a Ruby Regexp,
that's because Regexp isn't as fast as it should be :-).

tho_mica_l

unread,

Feb 6, 2008, 3:03:14 AM2/6/08

to

> I think the benefit of Regexp goes down the more of them you are using.

They are a convenient solution for a wide range of problems.

With respect to extending ruby's regexp, I'd rather be interested in
embedded code snippets that are evaluated when a clause matches. I
didn't know perl (http://perldoc.perl.org/perlre.html) does this until
I read about it in the ragel manual. It seems there are two kinds of
embedded code: (1) evaluate the code and match with null width; (2)
use the return value of the code as regexp. This sound quite
interesting to me. It's currently listed under "A-3. Lacked features
compare with perl 5.8.0" in the ruby RE manual.

Regards,
Thomas.

Eric Mahurin

unread,

Feb 6, 2008, 9:46:18 AM2/6/08

to

[Note: parts of this message were removed to make it a legal post.]

On Feb 6, 2008 2:05 AM, tho_mica_l <mica...@gmail.com> wrote:

> > I think the benefit of Regexp goes down the more of them you are using.
>
> They are a convenient solution for a wide range of problems.

You are taking my statement out of context. I was referring to
performance. If you are using a bunch of little regex's, you won't see much
performance benefit compared to matching a character at a time with a 1
character lookahead (LL(1) parsing). I think you can write a pure ruby
LL(1) lexer/parser that usually meets or beats a regex version.

But yes, using regex's are are convenient for a wide range of problems.

With respect to extending ruby's regexp, I'd rather be interested in
> embedded code snippets that are evaluated when a clause matches. I
> didn't know perl (http://perldoc.perl.org/perlre.html) does this until
> I read about it in the ragel manual. It seems there are two kinds of
> embedded code: (1) evaluate the code and match with null width; (2)
> use the return value of the code as regexp. This sound quite
> interesting to me. It's currently listed under "A-3. Lacked features
> compare with perl 5.8.0" in the ruby RE manual.
>

Yep. I used this in perl 5+ years ago (before coming to ruby) to write
recursive regex's (i.e. paren matching).

But, this doesn't address the issue I was referring to - applying a Regexp
to something other than a String.

James Gray

unread,

Feb 6, 2008, 10:42:56 AM2/6/08

to

On Feb 5, 2008, at 5:27 PM, Eric Mahurin wrote:

> ch/s F E gzip author/gem
> ------- - - ---- ----------

> 186042 5 0 685 James Edward Gray II (RE, recursive descent)

Here's a revised version of this parser that passes all of the tests:

#!/usr/bin/env ruby -wKU

require "strscan"

class JSONParser
AST = Struct.new(:value)

def parse(input)
@input = StringScanner.new(input.strip)
parse_value.value
ensure
@input.eos? or error("Unexpected data")
end

private

def parse_value
trim_space

parse_object or
parse_array or
parse_string or
parse_number or

parse_keyword or

error("Illegal JSON value")

ensure
trim_space
end

def parse_object
if @input.scan(/\{\s*/)
object = Hash.new
more_pairs = false
while key = parse_string

@input.scan(/\s*:\s*/) or error("Expecting object separator")

object[key.value] = parse_value.value
more_pairs = @input.scan(/\s*,\s*/) or break
end
error("Missing object pair") if more_pairs

@input.scan(/\s*\}\s*/) or error("Unclosed object")

AST.new(object)
else
false
end
end

def parse_array
if @input.scan(/\[\s*/)
array = Array.new

more_values = false

while contents = parse_value rescue nil
array << contents.value

more_values = @input.scan(/\s*,\s*/) or break
end
error("Missing value") if more_values

@input.scan(/\s*\]\s*/) or error("Unclosed array")

AST.new(array)
else
false
end
end

def parse_string
if @input.scan(/"/)
string = String.new
while contents = parse_string_content || parse_string_escape
string << contents.value

end
@input.scan(/"\s*/) or error("Unclosed string")

AST.new(string)
else
false
end
end

def parse_string_content
@input.scan(/[^\\"]+/) and AST.new(@input.matched)
end

def parse_string_escape
if @input.scan(%r{\\["\\/]})
AST.new(@input.matched[-1])
elsif @input.scan(/\\[bfnrt]/)
AST.new(eval(%Q{"#{@input.matched}"}))

elsif @input.scan(/\\u[0-9a-fA-F]{4}/)

AST.new([Integer("0x#{@input.matched[2..-1]}")].pack("U"))
else
false
end
end

def parse_number
@input.scan(/-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?\b/) and
AST.new(eval(@input.matched))
end

def parse_keyword
@input.scan(/\b(?:true|false|null)\b/) and
AST.new(eval(@input.matched.sub("null", "nil")))
end

def trim_space
@input.scan(/\s+/)
end

def error(message)
if @input.eos?
raise "Unexpected end of input."
else
raise "#{message}: #{@input.peek(@input.string.length)}"
end
end
end

__END__

James Edward Gray II

James Koppel

unread,

Feb 6, 2008, 11:06:39 PM2/6/08

to

class StringIO
def strip!
ch = read(1)
ch = read(1) while /\s/ =~ ch
ungetc ch[0]
self
end

def sees?(str1)
str2 = read(str1.length)
str2.reverse.scan(/./) {|ch| ungetc ch[0]}
str2 == str1
end

alias oldread read
def read(expr,*buffer)
return oldread(expr,*buffer) unless expr.is_a? Regexp
str = ""
str << getc until eof? or (expr =~ str) == 0
if eof? and (expr =~ str) != 0
str.reverse.scan(/./){|ch| ungetc ch[0]}
return nil
end
maxmtch = str
until eof?
str << getc until eof? or expr.match(str)[0] != str
str << getc until eof? or expr.match(str)[0] == str
maxmtch = expr.match(str)[0]
end
str[maxmtch.length..-1].reverse.scan(/./){|ch|ungetc ch[0]}
str = maxmtch
if expr.match(str)[0]==str
str
else
ungetc str[-1]
str[0...-1]
end
end
end

class JSONParser

def parse(str)
parse_next(StringIO.new(str))
end

private

def parse_next(strio)
strio.strip!
if el=number(strio)
return el
elsif el=string(strio)
return el
end
{"true"=>true,"false"=>false,"null"=>nil}.each_pair do |k,v|
if strio.sees? k
strio.read k.length
return v
end
end

if strio.sees?("{")
obj = Hash.new
strio.getc

until strio.strip!.sees?("}")
key = string(strio) or raise
strio.strip!
raise unless strio.read(1) == ':'
val = parse_next(strio)
obj[key] = val
strio.strip!
raise unless strio.sees?('}') or strio.read(1) == ','
end
strio.getc
obj
elsif strio.read(1) == '['
arr = Array.new
until strio.strip!.sees?("]")
arr << parse_next(strio)
raise unless strio.sees?("]") or strio.read(1) == ','
end
strio.getc
arr
else
raise
end
end

def string(strio)
str=strio.read(/\"([^\"\\]|\\\"|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[a-fA-F0-9]{4})*\"/).
to_s.gsub(/\\u([a-fA-F0-9]{4})/){$1.to_i(16).chr}[1...-1] \
or return nil
str.gsub!(/\\[\"\\\/bfnrt]/){|s|eval("\"#{s}\"")}
str
end

def number(strio)
eval (strio.read(/-?(0|[1-9]\d*)(\.\d*)?([eE][+-]?\d+)?/).to_s)
end
end

____________________________________________________________________________________
Be a better friend, newshound, and
know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ

Eric Mahurin

unread,

Feb 7, 2008, 6:32:02 AM2/7/08

to

[Note: parts of this message were removed to make it a legal post.]

On Feb 6, 2008 9:42 AM, James Gray <ja...@grayproductions.net> wrote:

> Here's a revised version of this parser that passes all of the tests:
>
> #!/usr/bin/env ruby -wKU
>

Hi James,

I applied these fixes to my optimization of your StringScanner recursive
descent parser and applied a few more. I also got it working in ruby1.9.
It is at the end of this message. I couldn't figure out why yours wasn't
working with 1.9.

Also, I did some more benchmarking. This time I ran 4 passes for each
dataset and I picked the best bandwidth (ch/s) among the 2 longest running
datasets. I limited the depth to 10 because only the fastest 2 parsers need
deeper (to not parse <1s) and ruby seems to do strange things with garbage
collection. All of the fastest parsers are shown with results from the
depth=10 dataset.

I ran everything I could easily on 1.9 (didn't load any gems for 1.9).

Here are the results with this benchmarking that I think is more reliable:

1.8 1.9 F E gzip author/gem
------ --- - - ---- ----------
- - 5 0 545 Pawel Radecki (RE, recursive descent)
NL - 6 0 796 James Koppel (RE+StringIO, recursive descent)
NL NL 5 1 683 Justin Ethier (RE lexer, ruby eval, fixes)
NL - 3 2 1074 James Edward Gray II (peggy)
4662 SF 0 0 608 Eric Mahurin (Grammar0, no code-gen)
5264 - 6 2 588 ghostwheel (ghostwheel, fixes)
5764 - 2 0 706 Eric I (Treetop, unicode broken)
8246 - 2 0 631 Steve (Treetop, mismatches in benchmark)
11856 - 1 1 545 Clifford Heath (Treetop)
25841 - 0 0 842 Alexander Stedile (RE, recursive descent)
57217 - 0 0 723 Eric Mahurin (Grammar, v0.5)
152205 NL 2 1 660 Paolo Bonzini (RE, recursive descent)
- 191529 2 1 445 Thomas Link (RE lexer, ruby eval)
202090 - 0 0 774 James Edward Gray II (RE, recursive descent)
233646 - 0 0 653 Eric Mahurin (Grammar, unreleased)
270508 211266 1 0 - json
299259 - 6 0 - fjson (w/ C ext)
351163 235726 0 0 607 James & Eric M (RE, recursive)
359665 279539 3 0 405 Thomas & Paolo (RE + eval)
430251 106285 0 0 837 Eric Mahurin (LL(1), recursive descent)
2079190 - 0 0 653 Eric Mahurin (Grammar, unreleased, ruby2cext)
8534416 3217453 0 0 - json (w/ C ext)

NL: non-linear, SF: seg-fault

Here are some things to note:

* on closer inspection, some of the slower solutions are nonlinear (usually
quadratic runtime). ch/s is not meaningful since it just gets worse with
larger data sets.

* ruby1.9 killed my LL(1) parser performance. The reason for this is that
characters in 1.9 are full objects, where in 1.8 they are immediate
Fixnum's. 4X slower is not good. I need to go ask if we can get some
immediate 1-character Strings in ruby 1.9 to solve this. Anything that
worked with characters in 1.8 is going to see the same problem. My Grammar
classes generate similiar LL(1) parsers and will be affected in the same
way. It works so well in 1.8 because there is so little object creation
(using Regexp's creates more objects).

* Dominick Baton's ruby2cext is giving my dev Grammar class a 10X
performance boost. Woohoo! Only 4X away from the best hand-crafted C (not
bad considering the ruby and then the C were autogenerated). I generate
low-level stuff that optimizes well. High-level Regexp's will have little
benefit from ruby2cext.

Here is the benchmark code I used (along with the previously posted
RandomJSON class):

parser = JSONParser.new
generator = RandomJSON.new((ARGV[1]||0).to_i)
bandwidth = 0
bandwidth0 = 0
t0 = 0
l = nil
t = nil
max_depth = 10
(max_depth+1).times { |depth|
tree = generator.value(depth, depth%2)
s = tree.inspect
s.gsub!(/=>/, ':')
s.gsub!(/nil/, 'null')
tree2 = nil
l = s.length
t = nil
4.times {
Benchmark.bm { |b|
GC.start
t1 = b.report("#{depth} #{l} ") { tree2 = parser.parse(s) }
GC.start
raise("#{tree2.inspect}!=#{tree.inspect}") if tree2!=tree
GC.start
if (!t or t1.real<t.real)
t = t1
end
}
}
bandwidth = l/t.real
puts "#{bandwidth.to_i} chars/second"
break if (t.real>=(ARGV[0]||1).to_f or depth>=max_depth)
if (t.real>t0)
bandwidth0 = bandwidth
t0 = t.real
end
}
bandwidth = bandwidth0 if (bandwidth0>bandwidth)

puts "\n#{bandwidth.to_i} chars/second"

Here is another optimization of James' solution:

require "strscan"

class JSONParser

def parse(input)
@input = StringScanner.new(input)
@input.scan(/\s*/)
parse_value(out=[])

@input.eos? or error("Unexpected data")

out[0]
end

private

def parse_value(out)
if @input.scan(/\{\s*/)
object = {}
kv = []
until @input.scan(/\}\s*/)
object.empty? or @input.scan(/,\s*/) or error("Expected ,")
kv.clear
@input.scan(/"/) or error("Expected string")
parse_string(kv)
@input.scan(/:\s*/) or error("Expecting object separator")
parse_value(kv)

object[kv[0]] = kv[1]

end

out << object
elsif @input.scan(/\[\s*/)
array = []

until @input.scan(/\]\s*/)
array.empty? or @input.scan(/,\s*/) or error("Expected ,")
parse_value(array)
end

out << array
elsif @input.scan(/"/)
parse_string(out)

elsif @input.scan(/-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?\s*/)
out << eval(@input.matched)
elsif @input.scan(/true\s*/)
out << true
elsif @input.scan(/false\s*/)
out << false
elsif @input.scan(/null\s*/)
out << nil
else

error("Illegal JSON value")

end
end

def parse_string(out)
string = ""
while true
if @input.scan(/[^\\"]+/)

string.concat(@input.matched)

elsif @input.scan(%r{\\["\\/]})
string << @input.matched[-1]
elsif @input.scan(/\\[bfnrt]/)
string << eval(%Q{"#{@input.matched}"})

elsif @input.scan(/\\u[0-9a-fA-F]{4}/)

string << @input.matched[2..-1].to_i(16)
else
break
end

end
@input.scan(/"\s*/) or error("Unclosed string")

out << string

tho_mica_l

unread,

Feb 7, 2008, 7:26:48 AM2/7/08

to

Hi,

> - 191529 2 1 445 Thomas Link (RE lexer, ruby eval)

> 359665 279539 3 0 405 Thomas & Paolo (RE + eval)

May I ask which test are missing here, since it passes all my test.

> 2079190 - 0 0 653 Eric Mahurin (Grammar, unreleased, ruby2cext)

This made me quite curious. Unfortunately it seems ruby2cext doesn't
support ruby19 yet. Anyway, will this coming-up version of grammar be
a drop-in replacement/update for grammar 0.5, i.e. is it safe to use
grammar 0.5 now?

BTW I took Paolo's hacked version of my humble submission and
converted it for use with StringScanner which has better performance
it seems. Since Paolo uses numbered groups, StringScanner is fine. I
slightly modified the regexp to catch something like '"a" "b"'.

Regards,
Thomas.

require 'strscan'

class JSONParser

def initialize
@rxe = /
\[|\]|
\{|\}|
(:)|
(,[ \t\r\n]*[}\]])|
,|
("(?>[^"\\]+|\\(?:u[0-9a-fA-F]{4}|[bfnrt"\/\\]))*"(?![ \t\r
\n]"))|
-?(?=\d)(?>0|[1-9]\d*)(?>\.\d+)?(?>[Ee][+-]?\d+)?(?!\d)|
true|
false|
(null)|
(?>[ \t\r\n]+)|
((?>.+))
/xmu
end

def parse(json)
scanner = StringScanner.new(json)
out = []
until (scanner.skip(/[[:space:][:cntrl:]]*/); scanner.eos?)
scan = scanner.scan(@rxe)
if scanner[5] or scanner[2]
invalid(scanner[2] || scanner[5])
elsif scanner[1]
out << '=>'
elsif scanner[3]
out << scanner[3].gsub(/#/, '\\\\#')
elsif scanner[4]
out << 'nil'
else
out << scan
end
end
begin
return eval(out.join(' '))
rescue Exception => e
invalid(json)
end
end

def invalid(string)
raise RuntimeError, 'Invalid JSON: %s' % string
end

end

Ruby Quiz

unread,

Feb 7, 2008, 9:08:45 AM2/7/08

to

We saw a large variety of solutions for this week's problem. Many of them used
a parser generator to construct their parser. You do that by defining a grammar
that describes the syntax you need to read. The parser generator then
translates your grammar into parsing code that will match the described syntax.
Of the generators used, Treetop was definitely the most popular and is surely
worth a look if you want to do some grammar based parsing.

I'm not going to show a grammar based solution here, not because I don't like
them, but because I want to show a few of the ideas behind simple parsing. To
do that, we will need to examine a hand-rolled solution. Just to be clear
though, using grammar based parsers can often be a more robust choice if you can
find an official grammar for the content you need to parse.

Eric Mahurin sent in a hand-rolled solution that has quite a few advantages.
First, it is trivial to adapt so that the entire content to be parsed doesn't
need to be read into memory. It's also some very efficient code.
Unfortunately, that makes it a touch less obvious if you aren't already familiar
with parsing techniques.

Given that, I'm going to show my own hand-rolled recursive descent parser. It's
not as cool as Eric's but it's a little easier to take in. We will say it's a
good introduction to Eric's code, which you should be able to follow after I
break this down.

Here's the beginning of my parser:

#!/usr/bin/env ruby -wKU

require "strscan"

class JSONParser
AST = Struct.new(:value)

def parse(input)
@input = StringScanner.new(input)

parse_value.value
ensure

@input.eos? or error("Unexpected data")

end

private

# ...

One of the first concerns when parsing is the need to manage where you currently
are in the input. If you treat the input as an IO object, you can read input
piece by piece and the IO object itself naturally keeps track of where you are.
For String input though, it's often easier to use Ruby's standard StringScanner
class. It wraps the input and allows you to test regular expression matches
against the head of that input. It will tell you when they match or don't and
when they do, your position automatically advances forward beyond that match.
You can see that I set this up in the code above.

The only public method for my class is parse(). It prepares the StringScanner
as I just described, tries to parse a JSON value out of the input, then makes
sure we consumed all of the input. Note that my use of ensure here isn't very
standard. I'm just using it to run some code at the end of the method without
changing the return value of the method.

The AST definition also merits a bit of discussion. It would have been nice to
just have each method return the Ruby objects for the JSON it parsed. However,
false and nil (null in JSON) are legal JSON parses in some places. If I return
those, I won't be able to tell if my parse succeeded or failed. To get around
that, all parsed JSON values are wrapped in a trivial abstract syntax tree
object. Returning this object is always true and, after I've verified that the
parse worked, it's just one more method call to retrieve the actual value it
parsed.

It's worth mentioning that this parser is based on the not quite correct
definition of JSON I put forth in the quiz tests. Only objects and arrays are
allowed to be top-level JSON values. An easy fix is to replace this line

# ...

parse_value.value

# ...

with:

# ...

if top_level = parse_object || parse_array
top_level.value
else
error("Illegal top-level JSON object")
end

# ...

Now let's look at the main parser:

# ...

def parse_value
trim_space
parse_object or
parse_array or
parse_string or
parse_number or
parse_keyword or

error("Illegal JSON value")

ensure
trim_space
end

# ...

This method really sums up the basic strategy of recursive descent parsing. At
each point, try to read out one of the legal values that can occur there. You
can do that just by drilling down into more specialized methods that know how to
read that one thing. If at any time you can't read a legal value, you have an
error.

Let's dig into the specialized parsers a bit more to see how this works:

# ...

def parse_object
if @input.scan(/\{\s*/)

object = Hash.new
more_pairs = false
while key = parse_string
@input.scan(/\s*:\s*/) or error("Expecting object separator")
object[key.value] = parse_value.value
more_pairs = @input.scan(/\s*,\s*/) or break
end
error("Missing object pair") if more_pairs

@input.scan(/\s*\}/) or error("Unclosed object")

AST.new(object)
else
false
end
end

# ...

This code reads JSON objects. It's pretty linear, so let's digest it in order.
First, we have to have an opening brace or we don't have an object at all. We
can see here that I try a regular expression on the StringScanner to see if that
is indeed what's up next. If it is scan() will return true and @input will
advance past that brace for our future matches. If it's false, we can't read an
object and the whole attempt is a bust.

When we know we're inside an object, we create the Ruby equivalent (a Hash),
fill it with all of the string/value pairs we can read, then make sure we find a
closing brace. Reading the pairs is the trickiest part because we have to match
a string, followed by a colon, and finally a value. Then, if we find a comma,
we know another pair is expected. If not, we've read the whole object. Note
that I verify these assumptions at every step and toss error()s if any of them
fail. For parsing strings and values, we just reuse the parse_string() method
we first saw called in parse_value() and parse_value() itself.

You can see that I'm constantly trimming space around the JSON syntax. I could
have also done that with repeated calls to the trim_space() helper we saw used
in parse_value(), but that fattens up the code a bit and slows things down with
more tests. For these simple cases, I opted for the shortcut.

Having deciphered parse_object(), parse_array() is trivial:

# ...

def parse_array
if @input.scan(/\[\s*/)
array = Array.new
more_values = false
while contents = parse_value rescue nil
array << contents.value
more_values = @input.scan(/\s*,\s*/) or break
end
error("Missing value") if more_values

@input.scan(/\s*\]/) or error("Unclosed array")

AST.new(array)
else
false
end
end

# ...

This is identical to the process we just examined save that it's pulling
individual values in the middle instead of string/value pairs. This simplifies
the code a bit, as you can see. We also throw these objects into a Ruby Array
instead of a Hash.

Now we are ready for parse_string() and it has a couple of helpers:

# ...

def parse_string
if @input.scan(/"/)
string = String.new
while contents = parse_string_content || parse_string_escape
string << contents.value
end

@input.scan(/"/) or error("Unclosed string")

AST.new(string)
else
false
end
end

def parse_string_content
@input.scan(/[^\\"]+/) and AST.new(@input.matched)
end

def parse_string_escape
if @input.scan(%r{\\["\\/]})
AST.new(@input.matched[-1])
elsif @input.scan(/\\[bfnrt]/)
AST.new(eval(%Q{"#{@input.matched}"}))

elsif @input.scan(/\\u[0-9a-fA-F]{4}/)

AST.new([Integer("0x#{@input.matched[2..-1]}")].pack("U"))
else
false
end
end

# ...

Whenever a structure you need to read gets more complicated, you want to break
it down into smaller parsers that read more specialized pieces. Some probably
would have broken down the the string/value pairs from object (into a
parse_object_pair()), but you don't gain much for that and it was just simple
enough that I opted for the easier approach. Here though we need to handle
content and escapes differently and there may be any combination of them in any
order inside the string. Now the gain is worth it.

Content is easy enough to handle, since we can pass it through unaltered. It's
already content in a Ruby String object. Escapes we have to work on a little
more. Some we just unescape, but others need to be converted. I used pack() to
handle Unicode, but you can see that I was lazy and used eval() on the special
string escapes. All of these have the same meaning in Ruby and thanks to the
match I know it's safe to eval() the contents without worrying about embedded
Ruby nastiness.

With those defined, parse_string() is similar to parse_array(). Find the start
of a JSON string, pull content and escapes as long as we can, then find the end
of the string.

The last two parsers are the easiest of all:

# ...

def parse_number
@input.scan(/-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?\b/) and
AST.new(eval(@input.matched))
end

def parse_keyword
@input.scan(/\b(?:true|false|null)\b/) and
AST.new(eval(@input.matched.sub("null", "nil")))
end

# ...

These are just match and eval() as you can plainly see. Again the evals() are
safe because the match ensures we aren't facing any dangerous content.

Some feel that using regular expressions like this isn't true parsing. We
certainly could have chopped the number rule down into a bunch of smaller rules.
However, the number definition is squarely in the domain of what regular
expressions do well and I'm more of a practical kind of guy. I have access to
regular expressions with this setup, the needed expression isn't really all that
complex, and I like easy jobs. Thus I use them.

All that is left are the two helpers I used, though the implementations won't be
any surprise:

# ...

def trim_space
@input.scan(/\s+/)

end

def error(message)
if @input.eos?
raise "Unexpected end of input."
else
raise "#{message}: #{@input.peek(@input.string.length)}"
end
end
end

First, trim_space() can just try a match to advance us pass any whitespace. It
may fail, because there wasn't any whitespace to skip, but that doesn't affect
us any. We know that we aren't about to read whitespace after it is called,
either way.

My error() wrapper just raise()s exceptions. It adds the content left to parse
so you can see where you had trouble or replaces the message altogether to warn
you that all of the content was exhausted.

That's all it takes to build a simple JSON parser. Take some time to go look
through the other hand-rolled solutions now and I bet you'll be surprised by how
similar they work. Then you can look into grammars and how they simplify the
process of defining new grammars.

The final Ruby Quiz will take us back into the world of finance...

Eric Mahurin

unread,

Feb 7, 2008, 10:38:41 AM2/7/08

to

[Note: parts of this message were removed to make it a legal post.]

On Feb 7, 2008 6:30 AM, tho_mica_l <mica...@gmail.com> wrote:

> Hi,
>
> > - 191529 2 1 445 Thomas Link (RE lexer, ruby eval)
> > 359665 279539 3 0 405 Thomas & Paolo (RE + eval)
>
> May I ask which test are missing here, since it passes all my test.
>

I'm running all tests from this thread. Here are the ones that the above
solutions failed:

assert_raise(RuntimeError) { @parser.parse(%Q{"a" "b"}) }

assert_equal( "a", @parser.parse(%Q{"\\u#{"%04X" % 97}"}) ) # 97 was ?a,
1.9 compat.

assert_equal('#{p 123}', @parser.parse(%q{"\\u0023{p 123}"}))

Maybe the versions of these I have are old.

>
> > 2079190 - 0 0 653 Eric Mahurin (Grammar, unreleased, ruby2cext)
>
> This made me quite curious. Unfortunately it seems ruby2cext doesn't
> support ruby19 yet. Anyway, will this coming-up version of grammar be
> a drop-in replacement/update for grammar 0.5, i.e. is it safe to use
> grammar 0.5 now?

I'll have a compatibility layer that will use the new stuff. You'll still
get all of the performance benefits, but this will ease any transition.
Most of the API changes will be minor. It will be a cleaner API.

Eric Mahurin

unread,

Feb 8, 2008, 1:27:28 PM2/8/08

to

[Note: parts of this message were removed to make it a legal post.]

On Feb 7, 2008 8:08 AM, Ruby Quiz <ja...@grayproductions.net> wrote:

> We saw a large variety of solutions for this week's problem. Many of them
> used
> a parser generator to construct their parser. You do that by defining a
> grammar
> that describes the syntax you need to read. The parser generator then
> translates your grammar into parsing code that will match the described
> syntax.
> Of the generators used, Treetop was definitely the most popular and is
> surely
> worth a look if you want to do some grammar based parsing.

But not nearly the fastest... My very old Grammar class already generates
parsers 5-10X faster than treetop and my next release will give another
5-10X plus using ruby2cext will give another 5-10X (generates code that
optimizes well with ruby2cext). Only a pure-C parser would beat it. I've
had this stuff sitting around locally for too long and need to release.

.. i'll get off my (biased) soapbox now ...

<snip>

Eric Mahurin sent in a hand-rolled solution that has quite a few advantages.
> First, it is trivial to adapt so that the entire content to be parsed
> doesn't
> need to be read into memory. It's also some very efficient code.

Computationally, I believe this is the simplest (most efficient) type -
LL(1). I parses "L"eft-to-right and derives the "L"eftmost part of a
grammar using (1) token/character. When the token is a character it becomes
very easy to parse a file by (just need a variable to hold the next
character in the file - previously read with IO.getc).

> Unfortunately, that makes it a touch less obvious if you aren't already
> familiar
> with parsing techniques.

Yep. When rolling by hand, Regexp makes it a bit easier to do recursive
descent parsers since you can easily do more than just a character at a
time. If no Regexp has unlimited match length, you'll have an LL(k) parser
where k is that max match length. You could adapt reading from a file by
keeping a string buffer of the next (k) characters in the file. I you have
a Regexp that can have unlimited match length, I think you might call that
an LL(*) parser. You'll have more problems parsing from a file in this
case, unless possibly you can keep all matches contained in a line
(IO.getswould act as a mini-parser looking for newline).

Look here if you want to see more about LL parsers:

http://en.wikipedia.org/wiki/LL_parser

> The AST definition also merits a bit of discussion. It would have been
> nice to
> just have each method return the Ruby objects for the JSON it parsed.
> However,
> false and nil (null in JSON) are legal JSON parses in some places. If I
> return
> those, I won't be able to tell if my parse succeeded or failed. To get
> around
> that, all parsed JSON values are wrapped in a trivial abstract syntax tree
> object. Returning this object is always true and, after I've verified
> that the
> parse worked, it's just one more method call to retrieve the actual value
> it
> parsed.

The approach I like to take is have the caller pass the AST to be modified.
I just use a sequence (Array or even String) to represent the sequence of
branches at that level. Then each grammar method can put none to an
arbitrary number of branches in the AST. There is much more flexibility.
The grammar method just returns true or false for whether it matches
independent of the AST. Some parsers don't even need an AST (or a sparse
one) since things might be done on the fly as parsing occurs. Passing the
AST separately works in all these cases.

James Gray

unread,

Feb 8, 2008, 1:57:17 PM2/8/08

to

On Feb 8, 2008, at 12:27 PM, Eric Mahurin wrote:

> On Feb 7, 2008 8:08 AM, Ruby Quiz <ja...@grayproductions.net> wrote:
>
>> We saw a large variety of solutions for this week's problem. Many
>> of them
>> used
>> a parser generator to construct their parser. You do that by
>> defining a
>> grammar
>> that describes the syntax you need to read. The parser generator
>> then
>> translates your grammar into parsing code that will match the
>> described
>> syntax.
>> Of the generators used, Treetop was definitely the most popular and
>> is
>> surely
>> worth a look if you want to do some grammar based parsing.
>
>
> But not nearly the fastest...

That's true, but I think the need for raw speed in parsers is seldom
the top priority. Large XML can often cause speed issues, but outside
of that I don't personally need many super fast parsers. Maybe I just
don't run into those problems a lot though.

James Edward Gray II

Clifford Heath

unread,

Feb 8, 2008, 6:57:11 PM2/8/08

to

James Gray wrote:
> On Feb 8, 2008, at 12:27 PM, Eric Mahurin wrote:
>>> Treetop was definitely the most popular

>> But not nearly the fastest...
> That's true, but I think the need for raw speed in parsers is seldom
> the top priority.

I was frankly amazed how much of the discussion was about speed.
Besides, there's an awful lot of useful languages that you can't
do with LL(1). Including Ruby and C...

That said, Treetop is very slow, and we need to improve that.
Fortunately, it's not very hard to.

Eric Mahurin

unread,

Feb 8, 2008, 7:40:36 PM2/8/08

to

[Note: parts of this message were removed to make it a legal post.]

On Feb 8, 2008 5:59 PM, Clifford Heath <n...@spam.please.net> wrote:

> James Gray wrote:
> > On Feb 8, 2008, at 12:27 PM, Eric Mahurin wrote:
> >>> Treetop was definitely the most popular
> >> But not nearly the fastest...
> > That's true, but I think the need for raw speed in parsers is seldom
> > the top priority.
>
> I was frankly amazed how much of the discussion was about speed.
> Besides, there's an awful lot of useful languages that you can't
> do with LL(1). Including Ruby and C...

There are techniques to convert LL(k) and LR(k) grammars to LL(1) and LR(1)
(factoring, left-recursion removal), but they can make the grammar get large
and can make the actions difficult. C shouldn't be an issue. Ruby is
definitely a beast to parse. Starting with the YACC spec, (almost) any
LL/recursive-descent/packrat parser will have a hard time dealing with the
left-recursion. Not to mention all of the lexer states. Starting from
scratch is probably a better choice for many LL/recursive-descent/packrat in
handling LR grammars (with left-recursion). That being said, I have a way
of handling left-recursion directly in my development LL(1/*) parser
generator. I haven't seen any other LL parsers do it.

Clifford Heath

unread,

Feb 8, 2008, 9:42:53 PM2/8/08

to

Eric Mahurin wrote:
> There are techniques to convert LL(k) and LR(k) grammars to LL(1) and LR(1)
> (factoring, left-recursion removal), but they can make the grammar get large

That's an understatement! They must magnify the number of rules
by something like M^(k-1), where M is the number of terminals...
or something like that. It's not reasonable to consider, and still
doesn't do LL(*) like a packrat does.

C labels require 2 levels of lookahead.

James Gray

unread,

Feb 8, 2008, 10:35:45 PM2/8/08

to

On Feb 8, 2008, at 5:59 PM, Clifford Heath wrote:

> James Gray wrote:
>> On Feb 8, 2008, at 12:27 PM, Eric Mahurin wrote:
>>>> Treetop was definitely the most popular
>>> But not nearly the fastest...
>> That's true, but I think the need for raw speed in parsers is
>> seldom the top priority.
>
> I was frankly amazed how much of the discussion was about speed.

Don't be. It's a common pastime for little challenges like this.
It's the easiest metric to quantify, so it makes for fun
comparisons. :)

James Edward Gray II

ThoML

unread,

Feb 9, 2008, 1:55:40 AM2/9/08

to

> I was frankly amazed how much of the discussion was about speed.

I personally found it quite interesting to see how well a hand-crafted
parser could perform. I initially assumed the hackish RE-based
solution would be fastest.

Another aspect would of course be how long it takes to create such a
parser in comparison to the other solutions. Unfortunately, we don't
have timings for that.

> That said, Treetop is very slow, and we need to improve that.

My main concern with treetop isn't so much speed but rather that the
polygot approach. While the idea per se is pretty cool, it seems to
preclude a programmatic generation/extension of a grammar definition.
Eg store the rules as an array of lambdas, programmatically add a new
rule on demand, recreate the parser etc. If I understand it right,
this isn't easily possible with treetop. I scanned the source code a
little bit and it seems treetop requires the parser definition to be
saved on disk. I assume you're now going to tell me I'm wrong? Please
do so. I'd be glad to here it's otherwise.

Regards,
Thomas.

Clifford Heath

unread,

Feb 9, 2008, 6:21:22 AM2/9/08

to

ThoML wrote:
>> I was frankly amazed how much of the discussion was about speed.
> I personally found it quite interesting to see how well a hand-crafted
> parser could perform. I initially assumed the hackish RE-based
> solution would be fastest.

Yes, it was interesting. I just thought that a few other things
were interesting too ;-)

> My main concern with treetop isn't so much speed but rather that the
> polygot approach.

Well, I'm the author of the Polyglot gem, and that doesn't preclude
the idea of executing Ruby blocks during the parse. I'd be much in
favor of that actually, but it's done Nathan's way as he created
Treetop. I like the way ANTLR and grammar let you build the parse
tree you want. Or not. But remember, packrat needss parse tree nodes
that contain enough to enable the backtracking.

Since Treetop has the ability to request a custom node (using the
<my_syntax_node_class> annotation) perhaps you can do what you want
by adding code to the constructor in that class? Just remember though,
the node might get discarded during backtracking, so don't do anything
too permanent :-).

> Eg store the rules as an array of lambdas, programmatically add a new
> rule on demand, recreate the parser etc. If I understand it right,
> this isn't easily possible with treetop.

I think it would be a sin against the gods of good taste to do that,
especially when the thing that did it might get discarded!

> I scanned the source code a
> little bit and it seems treetop requires the parser definition to be
> saved on disk. I assume you're now going to tell me I'm wrong? Please
> do so. I'd be glad to here it's otherwise.

It's otherwise. The compiler uses a Builder pattern, and then either
saves that or eval's it. When Polyglot requires a grammar, it first
passes through the normal (or rubygems+normal) require, which will
load a .rb file if one exists in the path. Otherwise it compiles
the grammar file and eval's it.

I think, without testing it, that multiple require's in the 2nd
situation will compile the grammar multiple times, which would be
a bug... I'll test for that tomorrow... bed now ;-).

Clifford Heath.

Eric Mahurin

unread,

Feb 9, 2008, 10:45:25 AM2/9/08

to

[Note: parts of this message were removed to make it a legal post.]

On Feb 9, 2008 12:59 AM, ThoML <mica...@gmail.com> wrote:

> > I was frankly amazed how much of the discussion was about speed.
>
> I personally found it quite interesting to see how well a hand-crafted
> parser could perform. I initially assumed the hackish RE-based
> solution would be fastest.
>

I would have thought that too. I knew my LL(1) hand solution would be near
the top but was surprised that it came out on top (at least on my
machine/benchmark) especially considering the RE+eval (especially yours)
solutions. I think object creation was a big factor. My LL(1) parser
created basically nothing but the AST. The match objects likely slowed down
the RE solutions (StringScanner was probably best off since the match object
is just a string). When you turn off GC, some of the RE solutions beat
mine.

Another aspect would of course be how long it takes to create such a
> parser in comparison to the other solutions. Unfortunately, we don't
> have timings for that.
>

I'm guessing that you could crank out an initial RE solution the fastest.
But, those solutions were also the ones that failed as more tests were
added, since they were hacky. The solutions that took no shortcuts and
followed the spec to a T from the beginning may have taken longer initially,
but they had fewer issues.

>
> > That said, Treetop is very slow, and we need to improve that.
>
> My main concern with treetop isn't so much speed but rather that the
> polygot approach. While the idea per se is pretty cool, it seems to
> preclude a programmatic generation/extension of a grammar definition.
> Eg store the rules as an array of lambdas, programmatically add a new
> rule on demand, recreate the parser etc.

My Grammar classes definitely give you that ability. One of the many powers
of using Ruby as the DSL.

For example, if you deal with a "," separated list terminated by ";" for a
variety of items types, you could do something like this:

comma_list = lambda { |item| item.list1(E(?;), E(?,)) }

then use it later like this

string_list = comma_list[string] # string defined earlier
number_list = comma_list[number] # number defined earlier

One of the powers is creating your own generic grammar constructs on the
fly.

Benjohn Barnes

unread,

Feb 19, 2008, 6:23:55 PM2/19/08

to

I was reading about PEG parsers on the wikipedia page [1], and wondered
along to a site about packrat parsers [2].

One article there [3] seems to imply that while a packrat parser's
memoization gives linear theoretical time, for almost all real grammars
(and inputs, I presume) it's actually not necessary.

They suggest that for the cases where there is a an exponential time
problem, it's sensible to rely on user supplied memoization directives
in the grammar, or to add memoization selectively and automatically,
based on profiling information.

I'm not sure if that's any help, or if you've already come across these
articles. Hopefully they're of some interest though.

Cheers,
Benjohn

[1]http://en.wikipedia.org/wiki/Parsing_expression_grammar
[2]http://pdos.csail.mit.edu/~baford/packrat/
[3]http://www.mercury.csse.unimelb.edu.au/information/papers/packrat.pdf

Clifford Heath

unread,

Feb 19, 2008, 9:22:32 PM2/19/08

to

Benjohn Barnes wrote:
> One article there [3] seems to imply that while a packrat parser's
> memoization gives linear theoretical time, for almost all real grammars
> (and inputs, I presume) it's actually not necessary.

There's a bit of a chicken/egg problem here.

Existing computer languages have been created to be easy to parse,
which means they require minimum lookahead- one or two tokens.

In part, that's because such languages are easier for humans to
parse as well, and in part because parser generators couldn't
produce efficient parsers for them... until now. Now that packrat
is on the scene, we're free to create more natural looking grammars.

> They suggest that for the cases where there is a an exponential time
> problem, it's sensible to rely on user supplied memoization directives
> in the grammar,

That requires a lot more user understanding (of where to memoize),
or a whole lot more analysis. ANTLR takes the 2nd approach, but it's
a difficult task that still causes many surprises. Treetop is slow,
but it's easily accessible to Joe Average without special skills and
without major surprises. I think it fills an important niche.

> or to add memoization selectively and automatically,
> based on profiling information.

That would be cool. Write an exponential-time parser, run it over
your test cases in learning mode, and then regenerate it to get a
fast parser.

Clifford Heath.

Benjohn Barnes

unread,

Mar 2, 2008, 6:25:37 PM3/2/08

to

Clifford Heath wrote:
> Benjohn Barnes wrote:
>> One article there [3] seems to imply that while a packrat parser's
>> memoization gives linear theoretical time, for almost all real
>> grammars (and inputs, I presume) it's actually not necessary.
>
> There's a bit of a chicken/egg problem here.
>
> Existing computer languages have been created to be easy to parse,
> which means they require minimum lookahead- one or two tokens.
>
> In part, that's because such languages are easier for humans to
> parse as well, and in part because parser generators couldn't
> produce efficient parsers for them... until now. Now that packrat
> is on the scene, we're free to create more natural looking grammars.

That's an interesting suggestion - that languages are as they are to a
great extent because of the difficulty of implementing their parsers, so
existing languages don't need memoized parsing. Certainly a compelling
point. Do you see newer parsing opening up language design then?

I bumped in to a few suggestions of plugable language syntax while
reading about PEG, and I definitely like that idea.

Cheers,
B