Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Generic Parsing Library

12 views
Skip to first unread message

Adam Sanderson

unread,
Aug 16, 2005, 5:42:04 PM8/16/05
to
I was wondering if anyone would be interested in, or knows of a generic
parsing library. I am continually faced with reading in bizarre text
files and parsing them. They tend to have regular structures though
(at the whim of researcher who made them). I'd like to write up some
sort of declarative code to parse these files. There's a lot of room
for reuse.

The data tends to be structured, but not rigorously, and changes
whenever someone feels like it, it's not hard to parse manually but
wouldn't it be nice to do a little metaprogramming at the top of a
class and say something like this? (not a rigorous example)

class LoopDetector
one :header, :hash, :start_after=>/^\*+$/,
:end_before=>/^\*+$/, :split=>/:\s+/
many :days, LoopData, :start_after=>:header,
:end_before=>/\n\n\n/
end

Most of the data can be broken down into:
- Spacer lines
- Hashes
- Tables
- Garbage (No seriously, some of these files have completely pointless
information in a lot of them)

Any ideas folks?
.adam sanderson


Here's one example of the type of data I get to play with (in reality
it goes from 00:00 -> 23:55 for each set of Loop Data, and there are
about 200 sets of Raw Loop Data). For anyone who's interested this is
loop detector data, which measures the amount of traffic on freeways.

***********************************
Filename: 0076ON04.cdl
Extracted by: CDR_Auto version 3.31 BETA g
Creation Date: Mar27/05 (Sun)
Creation Time: 20:23:09
File Type: TEXT
***********************************


ES-076R:_CN_O_1 I-5 MLK Jr Way-NB 157.13
01/01/04 (Thu)

---Raw Loop Data Listing---

Time Vol Occ Flg nPds
00:00 5 0.4% 1 15
00:05 11 1.2% 1 15
00:10 14 1.2% 1 15
23:50 3 0.5% 2 15
23:55 3 0.4% 1 15

ES-076R:_CN_O_1 I-5 MLK Jr Way-NB 157.13
01/02/04 (Fri)

---Raw Loop Data Listing---

Time Vol Occ Flg nPds
00:00 0 0.0% 0 0
00:05 0 0.0% 0 0
00:10 0 0.0% 0 0
00:15 0 0.0% 0 0
23:50 0 0.0% 0 0
23:55 26 3.8% 2 10

Berger, Daniel

unread,
Aug 16, 2005, 6:03:03 PM8/16/05
to
> -----Original Message-----
> From: Adam Sanderson [mailto:netg...@gmail.com]
> Sent: Tuesday, August 16, 2005 3:46 PM
> To: ruby-talk ML
> Subject: Generic Parsing Library
>
>
> I was wondering if anyone would be interested in, or knows of
> a generic parsing library. I am continually faced with
> reading in bizarre text files and parsing them. They tend to
> have regular structures though (at the whim of researcher who
> made them). I'd like to write up some sort of declarative
> code to parse these files. There's a lot of room for reuse.

Whatever happened to Rockit?! Robert Feldt, where are you?

Dan


Kirk Haines

unread,
Aug 16, 2005, 5:58:55 PM8/16/05
to
On Tuesday 16 August 2005 3:46 pm, Adam Sanderson wrote:
> I was wondering if anyone would be interested in, or knows of a generic
> parsing library. I am continually faced with reading in bizarre text
> files and parsing them. They tend to have regular structures though
> (at the whim of researcher who made them). I'd like to write up some
> sort of declarative code to parse these files. There's a lot of room
> for reuse.

Once upon a time, in a job far, far away, I wrote an ETL (Extract, Transform,
Load) system in Perl. It had a lot of bells and whistles, but the core of it
was an XML language that described data sources, transformations, and data
destinations. The code was "compiled" to Perl to execute it.

It was neat. Every since I started using Ruby, I've thought that it could be
done better with Ruby.

For starters, the whole notion of an XML language that is parsed and turned
into executable code could be replaced by a domain specific language. It'd
be a beautiful thing, and I bet it could be done in a LOT fewer lines than
when I did it in Perl.


Kirk Haines


James Edward Gray II

unread,
Aug 16, 2005, 6:24:51 PM8/16/05
to
On Aug 16, 2005, at 4:46 PM, Adam Sanderson wrote:

> I was wondering if anyone would be interested in, or knows of a
> generic
> parsing library.

I've just recently been throwing together my own tool for this. I
just got done using it in a real-world (paid) project. It's small
and really just a chainsaw tool for data mining, but it seems to be a
good start. I haven't documented it yet, but here are a couple of
examples from my unit tests:

def test_complex
path = File.join(File.dirname(__FILE__), "ross_report.txt")
test = self

input(path) do
@state = :skip
start_skipping_at("\f")
stop_skipping_at(/\A-[- ]+-\Z/)
skip(/\A\s*\Z/)
skip(/--\Z/)

find_in_skipped(/((?:Period|Week)\s+\d.+?)\s*\Z/) do |
period|
test.assert_equal("Period 02/2002", period)
end

stop_at("*** Selection Criteria ***")

read do |line|
test.assert_match(/\A\s+(?:Sales|Cust|SA)|\A[-\w]+\s
+/, line)
end
end

path = File.join(File.dirname(__FILE__), "car_ads.txt")

data = input(path, "") do
@state = :skip
stop_skipping_at("Save Ad")
skip(/\A\s*\Z/)

pre { @price = @miles = nil }
read(/\$([\d,]+\d)/) { |price| @price = price.delete
(",").to_i }
read(/([\d,]*\d)\s*m/) { |miles| @miles = miles.delete
(",").to_i }

read do |ad|
if @price and @price < 20_000 and @miles and @miles
< 40_000
(@ads ||= Array.new) << ad.strip
end
end
end

assert_equal([<<END_AD.strip], data.ads)
2003 Chrysler Town & Country LX
$16,990, green, 21,488 mi, air, pw, power locks, ps, power
mirrors,
dual air bags, keyless entry, intermittent wipers, rear defroster,
alloy,
pb, abs, cruise, am/fm stereo, CD, cassette, tinted glass
VIN:2C4GP44363R153238, Stock No:C153238, CALL DAN PERKINS AT
1-800-432-6326
END_AD
end

__END__

The first half of that is parsing the report from Ruby Quiz #17
(http://www.rubyquiz.com/quiz17.html). The second half is parsing a
listing of car ads (very unstructured data) looking for cars below a
certain price and mileage.

If people think this looking promising, I'll be happy to make it
available.

James Edward Gray II

Patrick May

unread,
Aug 16, 2005, 6:31:35 PM8/16/05
to
Quoting Adam Sanderson <netg...@gmail.com>:

> I was wondering if anyone would be interested in, or knows of a
> generic parsing library.

http://i.loveruby.net/en/prog/racc.html might be of use.

~ Patrick


Adam Sanderson

unread,
Aug 16, 2005, 6:34:39 PM8/16/05
to
It would be interesting to look at one way or another. It looks like
it could be useful for controlling some of the parsing. My old code
for parsing these types of files was in Java, and as with most of my
Java code, I realized that I've been trying to write in ruby all along
;)
.adam sanderson

James Edward Gray II

unread,
Aug 16, 2005, 6:36:05 PM8/16/05
to
On Aug 16, 2005, at 5:31 PM, Patrick May wrote:

Any good tutorials on Racc hiding on some corner of the net? I would
like to learn more about it.

James Edward Gray II


Adam Sanderson

unread,
Aug 16, 2005, 7:03:35 PM8/16/05
to
The mention of rockit from above was good too, it looks pretty
compelling. Here's the project:
http://rockit.sourceforge.net/
It looks pretty interesting. However rockit, racc, and rbison seem to
be somewhat involved for writing parsers. I'm thinking of something
much simpler perhaps. These work great for very common syntaxes where
you have a very large number of documents. I'm thinking more along the
lines of a flexible library for quickly defining many syntaxes with a
limited set of documents.

Then again, I might just need to play with these parsers a bit and see.
.adam sanderson

Eric Mahurin

unread,
Aug 16, 2005, 8:05:08 PM8/16/05
to
Here is an earlier version of what I'm working on:

http://groups-beta.google.com/group/comp.lang.ruby/browse_thread/thread/227313d6ff2ab1fa/8758532b85f2b001?q=bnf&rnum=2#8758532b85f2b001

Hopefully I'll do another release in a couple of weeks (under
the rubyforge project grammar).

Feature-wise, I think what I'm doing is closest to antlr (for
C++/C#/Java/Python). But, I believe it will be more powerful
and easier to use since you write the grammar in the target
progamming language (Ruby) and there is no need for a
code-generation phase.


____________________________________________________
Start your day with Yahoo! - make it your home page
http://www.yahoo.com/r/hs


Phil Tomson

unread,
Aug 16, 2005, 7:56:57 PM8/16/05
to

[I happened to see this on the ruby-talk archives, but it hasn't shown up
in the clr newsgroup yet... so I'm posting with the same subject]

Interesting idea.

There were also some posts back in April from Eric Mahurin about his
Syntax project (since renamed to 'grammar') which allows one to define a
BNF-like grammar directly in Ruby.

Phil

Hal Fulton

unread,
Aug 16, 2005, 8:34:14 PM8/16/05
to

Sweden, and he was very busy the last time I talked to
him.

No, the last five times. ;)


Hal


Trans

unread,
Aug 16, 2005, 9:11:39 PM8/16/05
to
Hi Adam,

Actually I have written something similar to what you describe, though
it is token based. It may be adaptable to what you describe. Certainly
it could use some twaeking, more testing and any improvements you might
offer. Here's an example of parsing something like XML.

require 'yaml'

s = %Q{
[p]
This is plain paragraph.
[t][b]This bold.[b.]This tee'd off.[t.]&tm;
[p.]
}

tokens = []

t = TokenParser::Token.new( :ONE )
t.start = lambda { |match| %r{ \[ (.*?) \] }mx }
t.stop = lambda { |match| %r{ \[ [ ]* (#{resc(match[1])}) (.*?) \. \]
}mx }
tokens << t

t = TokenParser::UnitToken.new( :TWO )
t.start = lambda { |match| ; %r{ \& (.*?) \; }x }
tokens << t

cp = TokenParser.new( *tokens )
d = cp.parse( s )
y d

outputs (don't let this scare you, its easy to traverse the content)

--- &id004 !ruby/array:TokenParser::Main
- "
"
- &id002 !ruby/object:TokenParser::Marker
content:
- >

This is plain paragraph.

- &id001 !ruby/object:TokenParser::Marker
content:
- !ruby/object:TokenParser::Marker
content:
- This bold.
inner_range: !ruby/range '36...46'
match: !ruby/object:MatchData {}
outer_range: !ruby/range '33...50'
parent: *id001
token: &id003 !ruby/object:TokenParser::Token
key: :ONE
parser:
start: !ruby/object:Proc {}
stop: !ruby/object:Proc {}
- "This tee'd off."
inner_range: !ruby/range '33...65'
match: !ruby/object:MatchData {}
outer_range: !ruby/range '30...69'
parent: *id002
token: *id003
- !ruby/object:TokenParser::Marker
content: []
match: !ruby/object:MatchData {}
outer_range: !ruby/range '69...73'
parent: *id002
token: !ruby/object:TokenParser::UnitToken
key: :TWO
parser:
start: !ruby/object:Proc {}
inner_range: !ruby/range '4...74'
match: !ruby/object:MatchData {}
outer_range: !ruby/range '1...78'
parent: *id004
token: *id003

Let me know if you'd like a copy to play with.

T.

Gavin Kistner

unread,
Aug 16, 2005, 10:09:26 PM8/16/05
to
On Aug 16, 2005, at 3:46 PM, Adam Sanderson wrote:
> I was wondering if anyone would be interested in, or knows of a
> generic
> parsing library. I am continually faced with reading in bizarre text
> files and parsing them. They tend to have regular structures though
> (at the whim of researcher who made them). I'd like to write up some
> sort of declarative code to parse these files. There's a lot of room
> for reuse.

I wrote TagTreeScanner, which can be used to parse text files when
the desired output is a hierarchy of nodes and text (i.e. an XML type
file).

http://phrogz.net/RubyLibs/OWLScribble/doc/tts.html


ogilt...@davie.textdrive.com

unread,
Aug 17, 2005, 9:00:29 AM8/17/05
to


Adam,

I've actually worked a bunch with Racc fairly recently. I'm don't have
the time (or expertise) to write up any tutorials any time soon, but I can
give you a brief description of my experience.

I started out by making incorrect assumptions. Racc does a great job of
generating a parser. The problem (that I misunderstood) is that at its
heart, Racc is a parser-generator and not much more. Racc generates a
parser that will accept or reject a sentence according to the grammar that
it is generated from. Where you go from there is up to you.

I used Racc to generate an Ada parser (without too much trouble). Though
I haven't put too much time into it, I've found the hard part to be
generating an Abstract Syntax Tree that worked for me. It's the AST that
will allow me to manipulate the stuff that I parsed.

Oh geez, I almost forgot. I got started using the Parser Generators
chapter from the Ruby Developer's Guide from Syngress. I think the
chapter may have been written by Robert Feldt (Mr. Rockit, or at least the
one that isnt't Herbie Hancock) and I enjoyed it. There's rockit sutff in
there too.

Good luck to you, and if you have any specific questions or problems feel
free to contact me personally.

mattD

Steven Jenkins

unread,
Aug 18, 2005, 9:55:20 AM8/18/05
to
James Edward Gray II wrote:
> Any good tutorials on Racc hiding on some corner of the net? I would
> like to learn more about it.

The Racc distribution includes several sample applications, including a
4-function calculator. If you've ever used yacc, that's probably enough
to get you going. If not, Kernighan & Pike's _The Unix Programming
Environment_ has a nice expository treatment of yacc in Chapter 8. (Now
22 years old, K&P is still a first-rate technical book.)

Steve


Patrick Gundlach

unread,
Aug 18, 2005, 6:05:02 PM8/18/05
to
Hi,

>> http://i.loveruby.net/en/prog/racc.html might be of use.
>
> Any good tutorials on Racc hiding on some corner of the net? I would
> like to learn more about it.

If there is some interest, I might post a small parser
(using racc/stringscanner) which I am working on right now (the parsing part
is finished). I am a novice with racc, though. But I think it is
pretty straightforward.

Patrick

(Just a simple language, like

set a b

do this using this, that, other
write 'file a', 'file.b.txt'

and alike)

James Edward Gray II

unread,
Aug 18, 2005, 9:32:08 PM8/18/05
to
On Aug 16, 2005, at 5:36 PM, Adam Sanderson wrote:

> It would be interesting to look at one way or another.

I don't want to send out a big announcement message until I get
documentation in there, but my parsing library is on RubyForge now:

http://rubyforge.org/projects/input/

It should be easy to figure out how to use it from the unit tests in
CVS. I did release a gem, if you want to install it.

James Edward Gray II

Adam Sanderson

unread,
Aug 19, 2005, 1:39:11 PM8/19/05
to
Great.
This looks very much like what I was imagining, or at least some part
of it. I think I'll play with it a little bit today or tonight and see
what I can do. By the way, I never knew you could do:
?u or ?a
to get the int codes for a letter, how odd.
.adam sanderson

James Edward Gray II

unread,
Aug 19, 2005, 1:47:53 PM8/19/05
to
On Aug 19, 2005, at 12:41 PM, Adam Sanderson wrote:

> Great.
> This looks very much like what I was imagining, or at least some part
> of it. I think I'll play with it a little bit today or tonight and
> see
> what I can do.

I have ideas for more features and I will document the next release,
so hopefully it will be more approachable. I'm using it to do real-
world tasks now though, so I think it has potential.

> By the way, I never knew you could do:
> ?u or ?a
> to get the int codes for a letter, how odd.

Ruby's just full of surprises. ;)

James Edward Gray II


0 new messages