We also will need to make sure that this architecture is agile and
quick to adapt. After all, if all it takes is the re-ordering of a
column or changing from Word to PDF in order to confuse the system, we
might find that some of our data sources will deliberately try to
sabotage our efforts.
Last night I read a few of the documents you have mentioned ("Hack,
Mash, and Peer" and "Government Data and the Invisible Hand"). At one
point, there is mention that in some data stores, many bills were
(Word or PDF) documents and some others were image files. In order to
scrape such a structure, our architecture would need to be relatively
flexible.
So while any parser that we write should certainly be as open-minded
as possible (while still providing reliable results) we should resign
ourselves to the fact that sometimes different parsers will be needed
for different parts of a data set.
-Erik
Scrapers and grammar inference are loads of fun. I got started on one
for extracting the information from http://www.parl.gc.ca/ for use in
mass mailouts. Never really finished that project...
I used Ruby+Hpricot to do the scraping. Hpricot is a great little
library well suited to the task. Nothing reads better than:
doc = Hpricot.parse(open("http://www.foo.com"))
(doc/"table").each() { |elem| process_element(elem) }
Don/
--
karfai [AT] gmail.com
http://www.strangeware.ca
http://blog.strangeware.ca
The inconsistency in table definitions and data definitions extends to
contract disclosure as well.
[Here's an example of the data available from the 4th quarter for Finance]
Finance, as you might expect, seems to apply their disclosure policies
for contracts with a view towards appropriate government accounting
policies. But last year, when I tried to track and cross-tab contracts
by vendor across all govt depts, smaller agencies and boards were
horrendously inconsistent in applying their disclosure.
--
David Akin
-------------------
http://www.davidakin.com
Agreed on ease on syntax. However, don't forget the potential of
helper apps.
There are tools where, when you click on a node in a document, will
display the XPath that leads to that node.
The XPath is overdetermined; you find useless things such as
/html[0]/body[0]/...
which can easily (and automatically) be pruned to
/html/body/...
More interesting is the possibility of pointing to an anchor:
//td[contains(lowercase(text()), 'first name')/next-siblings::td/text()
(typing from memory, I am surely making mistakes in my function names)
again, if we get a user to select the 'first name' text as an anchor,
and then the data, it is easy to generate the XPath automatically from
those two nodes.
Cheers,
MAP
I think you are on exactly the right track with this. We need to be
able to manage dozens or hundreds of simple modules for doing the
scraping, but the logic for processing, aggregating, and storing the
data must be centralized.
Also, there are a lot of tricky bits that can be common between data
adaptors ("scrapers"). For example, you will probably want to throttle
requests so that we don't overload the target server, and we will want
to schedule periodic updates. If data has not changed, we might even
want to use an HTTP cache in order to avoid retrieving large unchanged
data-sets. Over time we might want to do things like store the
original data that was scraped, for diagnostic or auditing purposes.
Even if we provide sample code for contributors to work from, having
this functionality implemented in each adaptor will become rapidly
unmanageable, not to mention the fact that it will be a high barrier
to entry for contributors.
So I see believe that our greatest efforts should be focused, as you
say, on creating the framework on which the adaptors can be deployed,
including the documentation and testing tools necessary to allow
individuals to easily develop, test, and maintain adaptors.
I suggest that we start thinking of this as the "Visible Government
Platform".
> Re: software tools. I started out thinking perl because that's what
> govtrack used, and also it's what w3cmir is written in. However, you
> guys are right, better tools are likely out there. Marc-Antoine: I
> had forgotten about libxml2... I've used it in past projects with the
> C binding. Also, it seems to be the fastest in this ruby-based
> comparison:
>
> http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/226482
>
> Ultimately, though, because I think this project is going to involve
> lots of people, ease of syntax is probably going to be one of the
> biggest considerations. Will do more research.
There are two questions here:
(1) What tools do we choose for implementation of the platform? and
(2) What tools do we support for the development of adaptors,
visualization components, etc?
I believe we should strive towards a heterogeneous solution that
supports the use of the best tools depending on the situation and
available resources. Sure a specific service might be developed in one
platform, but if we want to be able to accept bids from different
implementors for different services, we should be open to the use of a
variety of technologies. When it comes to integrating existing tools,
we will again be stronger if we can integrate them regardless of
whether they are in PHP, Ruby, or Java.
If we plan to crowd-source data adaptors, we will reach a much wider
audience if we support more than one language. Some very smart and
motivated people will only know Ruby, or Java, or Perl.
Finally, each technology has its own strengths and weaknesses that
might make it more or less suitable to different parts of the overall
solution.
This is a problem faced by many organizations, and there are various
ways to resolve it, whether it be SOAP, RESTful services, or automatic
library wrapper generation using SWIG. Google, for example, is known
to support C++, Java, and Python as "official languages" in which
services may be deployed. Facebook has open-sourced their own tool,
"Thrift", which claims to enable easy communication between all of the
above languages and more.
Which technique, if any, to adopt will depend on a more thorough
analysis of the requirements of the system we are building.
-Erik