The data tends to be structured, but not rigorously, and changes
whenever someone feels like it, it's not hard to parse manually but
wouldn't it be nice to do a little metaprogramming at the top of a
class and say something like this? (not a rigorous example)
class LoopDetector
one :header, :hash, :start_after=>/^\*+$/,
:end_before=>/^\*+$/, :split=>/:\s+/
many :days, LoopData, :start_after=>:header,
:end_before=>/\n\n\n/
end
Most of the data can be broken down into:
- Spacer lines
- Hashes
- Tables
- Garbage (No seriously, some of these files have completely pointless
information in a lot of them)
Any ideas folks?
.adam sanderson
Here's one example of the type of data I get to play with (in reality
it goes from 00:00 -> 23:55 for each set of Loop Data, and there are
about 200 sets of Raw Loop Data). For anyone who's interested this is
loop detector data, which measures the amount of traffic on freeways.
***********************************
Filename: 0076ON04.cdl
Extracted by: CDR_Auto version 3.31 BETA g
Creation Date: Mar27/05 (Sun)
Creation Time: 20:23:09
File Type: TEXT
***********************************
ES-076R:_CN_O_1 I-5 MLK Jr Way-NB 157.13
01/01/04 (Thu)
---Raw Loop Data Listing---
Time Vol Occ Flg nPds
00:00 5 0.4% 1 15
00:05 11 1.2% 1 15
00:10 14 1.2% 1 15
23:50 3 0.5% 2 15
23:55 3 0.4% 1 15
ES-076R:_CN_O_1 I-5 MLK Jr Way-NB 157.13
01/02/04 (Fri)
---Raw Loop Data Listing---
Time Vol Occ Flg nPds
00:00 0 0.0% 0 0
00:05 0 0.0% 0 0
00:10 0 0.0% 0 0
00:15 0 0.0% 0 0
23:50 0 0.0% 0 0
23:55 26 3.8% 2 10
Whatever happened to Rockit?! Robert Feldt, where are you?
Dan
Once upon a time, in a job far, far away, I wrote an ETL (Extract, Transform,
Load) system in Perl. It had a lot of bells and whistles, but the core of it
was an XML language that described data sources, transformations, and data
destinations. The code was "compiled" to Perl to execute it.
It was neat. Every since I started using Ruby, I've thought that it could be
done better with Ruby.
For starters, the whole notion of an XML language that is parsed and turned
into executable code could be replaced by a domain specific language. It'd
be a beautiful thing, and I bet it could be done in a LOT fewer lines than
when I did it in Perl.
Kirk Haines
> I was wondering if anyone would be interested in, or knows of a
> generic
> parsing library.
I've just recently been throwing together my own tool for this. I
just got done using it in a real-world (paid) project. It's small
and really just a chainsaw tool for data mining, but it seems to be a
good start. I haven't documented it yet, but here are a couple of
examples from my unit tests:
def test_complex
path = File.join(File.dirname(__FILE__), "ross_report.txt")
test = self
input(path) do
@state = :skip
start_skipping_at("\f")
stop_skipping_at(/\A-[- ]+-\Z/)
skip(/\A\s*\Z/)
skip(/--\Z/)
find_in_skipped(/((?:Period|Week)\s+\d.+?)\s*\Z/) do |
period|
test.assert_equal("Period 02/2002", period)
end
stop_at("*** Selection Criteria ***")
read do |line|
test.assert_match(/\A\s+(?:Sales|Cust|SA)|\A[-\w]+\s
+/, line)
end
end
path = File.join(File.dirname(__FILE__), "car_ads.txt")
data = input(path, "") do
@state = :skip
stop_skipping_at("Save Ad")
skip(/\A\s*\Z/)
pre { @price = @miles = nil }
read(/\$([\d,]+\d)/) { |price| @price = price.delete
(",").to_i }
read(/([\d,]*\d)\s*m/) { |miles| @miles = miles.delete
(",").to_i }
read do |ad|
if @price and @price < 20_000 and @miles and @miles
< 40_000
(@ads ||= Array.new) << ad.strip
end
end
end
assert_equal([<<END_AD.strip], data.ads)
2003 Chrysler Town & Country LX
$16,990, green, 21,488 mi, air, pw, power locks, ps, power
mirrors,
dual air bags, keyless entry, intermittent wipers, rear defroster,
alloy,
pb, abs, cruise, am/fm stereo, CD, cassette, tinted glass
VIN:2C4GP44363R153238, Stock No:C153238, CALL DAN PERKINS AT
1-800-432-6326
END_AD
end
__END__
The first half of that is parsing the report from Ruby Quiz #17
(http://www.rubyquiz.com/quiz17.html). The second half is parsing a
listing of car ads (very unstructured data) looking for cars below a
certain price and mileage.
If people think this looking promising, I'll be happy to make it
available.
James Edward Gray II
> I was wondering if anyone would be interested in, or knows of a
> generic parsing library.
http://i.loveruby.net/en/prog/racc.html might be of use.
~ Patrick
> http://i.loveruby.net/en/prog/racc.html might be of use.
Any good tutorials on Racc hiding on some corner of the net? I would
like to learn more about it.
James Edward Gray II
Then again, I might just need to play with these parsers a bit and see.
.adam sanderson
Hopefully I'll do another release in a couple of weeks (under
the rubyforge project grammar).
Feature-wise, I think what I'm doing is closest to antlr (for
C++/C#/Java/Python). But, I believe it will be more powerful
and easier to use since you write the grammar in the target
progamming language (Ruby) and there is no need for a
code-generation phase.
____________________________________________________
Start your day with Yahoo! - make it your home page
http://www.yahoo.com/r/hs
Interesting idea.
There were also some posts back in April from Eric Mahurin about his
Syntax project (since renamed to 'grammar') which allows one to define a
BNF-like grammar directly in Ruby.
Phil
Sweden, and he was very busy the last time I talked to
him.
No, the last five times. ;)
Hal
Actually I have written something similar to what you describe, though
it is token based. It may be adaptable to what you describe. Certainly
it could use some twaeking, more testing and any improvements you might
offer. Here's an example of parsing something like XML.
require 'yaml'
s = %Q{
[p]
This is plain paragraph.
[t][b]This bold.[b.]This tee'd off.[t.]&tm;
[p.]
}
tokens = []
t = TokenParser::Token.new( :ONE )
t.start = lambda { |match| %r{ \[ (.*?) \] }mx }
t.stop = lambda { |match| %r{ \[ [ ]* (#{resc(match[1])}) (.*?) \. \]
}mx }
tokens << t
t = TokenParser::UnitToken.new( :TWO )
t.start = lambda { |match| ; %r{ \& (.*?) \; }x }
tokens << t
cp = TokenParser.new( *tokens )
d = cp.parse( s )
y d
outputs (don't let this scare you, its easy to traverse the content)
--- &id004 !ruby/array:TokenParser::Main
- "
"
- &id002 !ruby/object:TokenParser::Marker
content:
- >
This is plain paragraph.
- &id001 !ruby/object:TokenParser::Marker
content:
- !ruby/object:TokenParser::Marker
content:
- This bold.
inner_range: !ruby/range '36...46'
match: !ruby/object:MatchData {}
outer_range: !ruby/range '33...50'
parent: *id001
token: &id003 !ruby/object:TokenParser::Token
key: :ONE
parser:
start: !ruby/object:Proc {}
stop: !ruby/object:Proc {}
- "This tee'd off."
inner_range: !ruby/range '33...65'
match: !ruby/object:MatchData {}
outer_range: !ruby/range '30...69'
parent: *id002
token: *id003
- !ruby/object:TokenParser::Marker
content: []
match: !ruby/object:MatchData {}
outer_range: !ruby/range '69...73'
parent: *id002
token: !ruby/object:TokenParser::UnitToken
key: :TWO
parser:
start: !ruby/object:Proc {}
inner_range: !ruby/range '4...74'
match: !ruby/object:MatchData {}
outer_range: !ruby/range '1...78'
parent: *id004
token: *id003
Let me know if you'd like a copy to play with.
T.
I wrote TagTreeScanner, which can be used to parse text files when
the desired output is a hierarchy of nodes and text (i.e. an XML type
file).
http://phrogz.net/RubyLibs/OWLScribble/doc/tts.html
Adam,
I've actually worked a bunch with Racc fairly recently. I'm don't have
the time (or expertise) to write up any tutorials any time soon, but I can
give you a brief description of my experience.
I started out by making incorrect assumptions. Racc does a great job of
generating a parser. The problem (that I misunderstood) is that at its
heart, Racc is a parser-generator and not much more. Racc generates a
parser that will accept or reject a sentence according to the grammar that
it is generated from. Where you go from there is up to you.
I used Racc to generate an Ada parser (without too much trouble). Though
I haven't put too much time into it, I've found the hard part to be
generating an Abstract Syntax Tree that worked for me. It's the AST that
will allow me to manipulate the stuff that I parsed.
Oh geez, I almost forgot. I got started using the Parser Generators
chapter from the Ruby Developer's Guide from Syngress. I think the
chapter may have been written by Robert Feldt (Mr. Rockit, or at least the
one that isnt't Herbie Hancock) and I enjoyed it. There's rockit sutff in
there too.
Good luck to you, and if you have any specific questions or problems feel
free to contact me personally.
mattD
The Racc distribution includes several sample applications, including a
4-function calculator. If you've ever used yacc, that's probably enough
to get you going. If not, Kernighan & Pike's _The Unix Programming
Environment_ has a nice expository treatment of yacc in Chapter 8. (Now
22 years old, K&P is still a first-rate technical book.)
Steve
>> http://i.loveruby.net/en/prog/racc.html might be of use.
>
> Any good tutorials on Racc hiding on some corner of the net? I would
> like to learn more about it.
If there is some interest, I might post a small parser
(using racc/stringscanner) which I am working on right now (the parsing part
is finished). I am a novice with racc, though. But I think it is
pretty straightforward.
Patrick
(Just a simple language, like
set a b
do this using this, that, other
write 'file a', 'file.b.txt'
and alike)
> It would be interesting to look at one way or another.
I don't want to send out a big announcement message until I get
documentation in there, but my parsing library is on RubyForge now:
http://rubyforge.org/projects/input/
It should be easy to figure out how to use it from the unit tests in
CVS. I did release a gem, if you want to install it.
James Edward Gray II
> Great.
> This looks very much like what I was imagining, or at least some part
> of it. I think I'll play with it a little bit today or tonight and
> see
> what I can do.
I have ideas for more features and I will document the next release,
so hopefully it will be more approachable. I'm using it to do real-
world tasks now though, so I think it has potential.
> By the way, I never knew you could do:
> ?u or ?a
> to get the int codes for a letter, how odd.
Ruby's just full of surprises. ;)
James Edward Gray II