Interesting: Expense Disclosure Pages are All Slightly Different

Jennifer Bell

unread,

Jul 28, 2008, 3:44:57 PM7/28/08

to VisibleGovernment_ExpenseDB

Call me silly, but my early initial browsing of the hospitality and
expense pages left me with the idea that there was at least a standard
table format for expense disclosure.

It turns out there isn't, really. Each department has done its own
thing, in a slightly different way. Check out these links from
different departments:

This is the vanilla format I was expecting to see as standard:
http://www4.agr.gc.ca/AAFC-AAC/display-afficher.do?id=1214843370679&lang=e

Here, data is divided into different tables by quarter, with dates
relative to that quarter:
http://www.ps-sp.gc.ca/abt/trv_hsp/year-detail-eng.aspx?uid=46&SelectedYear=2007&q=1&p=2007-03-30

Here, data is divided into 'travel' and 'hospitality' pages, and
there's an extra 'Recovered Costs' column:
http://www.oag-bvg.gc.ca/internet/English/oag-bvg_e_30885.html

Interesting. Potential scraping approaches:

- Write scripts that find things in this web tree that look like
tables, and do a best guess on column fit and syntax. Fix breaks
individually.
- Crowd-source: recruit lots of people, assign responsability to do
mapping by hand for each department into a standard format.

Or a mix of the two. Both will require a scraping framework to start
with.

Any thoughts? I'm looking into writing the scrapers immediately, or
at least in the short term, as I'm a little worried about getting
scooped on this project. The 'get' section on theinfo.org has some
good resources for scraping. (Good thing I spent some time in the
mid-90s writing perl scripts.)

In the short term, the data can always be uploaded to ManyEyes for
Visualization.

Jennifer

Marc-Antoine Parent

unread,

Jul 28, 2008, 3:59:34 PM7/28/08

to visiblegovern...@googlegroups.com

Hello, Jennifer!
For scraping: I highly recommend using libxml2 to parse the HTML, and
XPaths to pinpoint the info you need. It beats regexps in perl any
day, especially in terms of resilience to change if you are careful.
You can actually supplement the XPath syntax with (strategically used)
regexps using some XPath extensions that are standard in libxml2
(which is why I suggest that specific library.)
libxml2 has bindings in most scripting languages. I use lxml for Python.
Cheers,
Marc-Antoine

Erik Wright

unread,

Jul 28, 2008, 4:06:05 PM7/28/08

to visiblegovern...@googlegroups.com

In the long term, we will need to put in place an architecture that
separates the scraping of data from how it is stored or processed.

We also will need to make sure that this architecture is agile and
quick to adapt. After all, if all it takes is the re-ordering of a
column or changing from Word to PDF in order to confuse the system, we
might find that some of our data sources will deliberately try to
sabotage our efforts.

Last night I read a few of the documents you have mentioned ("Hack,
Mash, and Peer" and "Government Data and the Invisible Hand"). At one
point, there is mention that in some data stores, many bills were
(Word or PDF) documents and some others were image files. In order to
scrape such a structure, our architecture would need to be relatively
flexible.

So while any parser that we write should certainly be as open-minded
as possible (while still providing reliable results) we should resign
ourselves to the fact that sometimes different parsers will be needed
for different parts of a data set.

-Erik

Don Kelly

unread,

Jul 28, 2008, 4:07:40 PM7/28/08

to visiblegovern...@googlegroups.com

On Mon, Jul 28, 2008 at 3:44 PM, Jennifer Bell
<visibleg...@gmail.com> wrote:
> Any thoughts? I'm looking into writing the scrapers immediately, or
> at least in the short term, as I'm a little worried about getting
> scooped on this project. The 'get' section on theinfo.org has some
> good resources for scraping. (Good thing I spent some time in the
> mid-90s writing perl scripts.)
>

Scrapers and grammar inference are loads of fun. I got started on one
for extracting the information from http://www.parl.gc.ca/ for use in
mass mailouts. Never really finished that project...

I used Ruby+Hpricot to do the scraping. Hpricot is a great little
library well suited to the task. Nothing reads better than:

doc = Hpricot.parse(open("http://www.foo.com"))
(doc/"table").each() { |elem| process_element(elem) }

Don/

--
karfai [AT] gmail.com
http://www.strangeware.ca
http://blog.strangeware.ca

David Akin

unread,

Jul 28, 2008, 10:08:06 PM7/28/08

to visiblegovern...@googlegroups.com

On Mon, Jul 28, 2008 at 3:44 PM, Jennifer Bell
<visibleg...@gmail.com> wrote:
>

> Call me silly, but my early initial browsing of the hospitality and
> expense pages left me with the idea that there was at least a standard
> table format for expense disclosure.

The inconsistency in table definitions and data definitions extends to
contract disclosure as well.
[Here's an example of the data available from the 4th quarter for Finance]

Finance, as you might expect, seems to apply their disclosure policies
for contracts with a view towards appropriate government accounting
policies. But last year, when I tried to track and cross-tab contracts
by vendor across all govt depts, smaller agencies and boards were
horrendously inconsistent in applying their disclosure.

--
David Akin
-------------------
http://www.davidakin.com

Jennifer Bell

unread,

Jul 29, 2008, 1:28:13 PM7/29/08

to VisibleGovernment_ExpenseDB

Thanks to everyone for their responses.

You're right Erik, the design will have to be modular and fault
tolerant. Also, it has to separate gathering from displaying /
using. The software types here will likely have noticed that the
discussion has ranged through several concepts that are almost
completely different tools: scraping / visualizing / tagging +
rating. I've uploaded a draft use case document to the 'file' section
of this group that splits these up into more manageable chunks for
discussion. Also, there's a first pass at a common database schema
definition for scraping.

With regards to fault tolerance in the scraping system: Even if we
build a brilliant framework I have no doubt that next quarter, when
new data is published, there will be chaos. I was thinking that a
nice way to build crowd-sourcing into the system would be to set up
the general scraping framework, then post to craigslist or developer
newsgroups to get, say, 80-100 volunteers to take personal
responsability for adapting modules to a particular department and
monitoring breaks/changes. That way, when/if the data changes,
there's hopefully a faster response time. The added visibility +
public involvement helps advance the cause too.

Re: software tools. I started out thinking perl because that's what
govtrack used, and also it's what w3cmir is written in. However, you
guys are right, better tools are likely out there. Marc-Antoine: I
had forgotten about libxml2... I've used it in past projects with the
C binding. Also, it seems to be the fastest in this ruby-based
comparison:

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/226482

Ultimately, though, because I think this project is going to involve
lots of people, ease of syntax is probably going to be one of the
biggest considerations. Will do more research.

Jennifer

Marc-Antoine Parent

unread,

Jul 29, 2008, 1:35:31 PM7/29/08

to visiblegovern...@googlegroups.com

> Re: software tools. I started out thinking perl because that's what
> govtrack used, and also it's what w3cmir is written in. However, you
> guys are right, better tools are likely out there. Marc-Antoine: I
> had forgotten about libxml2... I've used it in past projects with the
> C binding. Also, it seems to be the fastest in this ruby-based
> comparison:
>
> http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/226482
>
> Ultimately, though, because I think this project is going to involve
> lots of people, ease of syntax is probably going to be one of the
> biggest considerations. Will do more research.

Agreed on ease on syntax. However, don't forget the potential of
helper apps.
There are tools where, when you click on a node in a document, will
display the XPath that leads to that node.
The XPath is overdetermined; you find useless things such as
/html[0]/body[0]/...
which can easily (and automatically) be pruned to
/html/body/...
More interesting is the possibility of pointing to an anchor:
//td[contains(lowercase(text()), 'first name')/next-siblings::td/text()
(typing from memory, I am surely making mistakes in my function names)
again, if we get a user to select the 'first name' text as an anchor,
and then the data, it is easy to generate the XPath automatically from
those two nodes.
Cheers,
MAP

Erik Wright

unread,

Jul 30, 2008, 10:41:28 AM7/30/08

to visiblegovern...@googlegroups.com

> With regards to fault tolerance in the scraping system: Even if we
> build a brilliant framework I have no doubt that next quarter, when
> new data is published, there will be chaos. I was thinking that a
> nice way to build crowd-sourcing into the system would be to set up
> the general scraping framework, then post to craigslist or developer
> newsgroups to get, say, 80-100 volunteers to take personal
> responsability for adapting modules to a particular department and
> monitoring breaks/changes. That way, when/if the data changes,
> there's hopefully a faster response time. The added visibility +
> public involvement helps advance the cause too.
>

I think you are on exactly the right track with this. We need to be
able to manage dozens or hundreds of simple modules for doing the
scraping, but the logic for processing, aggregating, and storing the
data must be centralized.

Also, there are a lot of tricky bits that can be common between data
adaptors ("scrapers"). For example, you will probably want to throttle
requests so that we don't overload the target server, and we will want
to schedule periodic updates. If data has not changed, we might even
want to use an HTTP cache in order to avoid retrieving large unchanged
data-sets. Over time we might want to do things like store the
original data that was scraped, for diagnostic or auditing purposes.
Even if we provide sample code for contributors to work from, having
this functionality implemented in each adaptor will become rapidly
unmanageable, not to mention the fact that it will be a high barrier
to entry for contributors.

So I see believe that our greatest efforts should be focused, as you
say, on creating the framework on which the adaptors can be deployed,
including the documentation and testing tools necessary to allow
individuals to easily develop, test, and maintain adaptors.

I suggest that we start thinking of this as the "Visible Government
Platform".

> Re: software tools. I started out thinking perl because that's what
> govtrack used, and also it's what w3cmir is written in. However, you
> guys are right, better tools are likely out there. Marc-Antoine: I
> had forgotten about libxml2... I've used it in past projects with the
> C binding. Also, it seems to be the fastest in this ruby-based
> comparison:
>
> http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/226482
>
> Ultimately, though, because I think this project is going to involve
> lots of people, ease of syntax is probably going to be one of the
> biggest considerations. Will do more research.

There are two questions here:

(1) What tools do we choose for implementation of the platform? and

(2) What tools do we support for the development of adaptors,
visualization components, etc?

I believe we should strive towards a heterogeneous solution that
supports the use of the best tools depending on the situation and
available resources. Sure a specific service might be developed in one
platform, but if we want to be able to accept bids from different
implementors for different services, we should be open to the use of a
variety of technologies. When it comes to integrating existing tools,
we will again be stronger if we can integrate them regardless of
whether they are in PHP, Ruby, or Java.

If we plan to crowd-source data adaptors, we will reach a much wider
audience if we support more than one language. Some very smart and
motivated people will only know Ruby, or Java, or Perl.

Finally, each technology has its own strengths and weaknesses that
might make it more or less suitable to different parts of the overall
solution.

This is a problem faced by many organizations, and there are various
ways to resolve it, whether it be SOAP, RESTful services, or automatic
library wrapper generation using SWIG. Google, for example, is known
to support C++, Java, and Python as "official languages" in which
services may be deployed. Facebook has open-sourced their own tool,
"Thrift", which claims to enable easy communication between all of the
above languages and more.

Which technique, if any, to adopt will depend on a more thorough
analysis of the requirements of the system we are building.

-Erik

Jennifer Bell

unread,

Aug 6, 2008, 2:17:41 PM8/6/08

to VisibleGovernment_ExpenseDB

Actually, setting up the framework might not be that hard. I've been
looking at scrubyt for the last couple of days, which is a combination
of ruby hpricot and mechanize. It has the big strength of being able
to infer extraction patterns from example text, and generate xpath-
based scrapers that can be used on other pages that are formatted the
same way. This makes things strangely easy, for the most part.

My current thinking is to separate the tool into two parts:

1. A step-by-step CLI to help volunteers assemble, test, and submit
scrubyt scrapers for a single department. This CLI minimizes the hits
on the govt. website during testing.

2. A central server that invokes the submitted scrapers (after they've
been reviewed) and manages the database. This central server manages
testing for changes, throttling requests, etc.

Also, there would have to be a wiki or some such tool for coordinating
volunteers.

I've been working on the volunteer CLI, and have something rough. My
first run setting up a new department with the CLI took about 20 min
(but then, I knew what to do!). My current thinking is to do 10
departments, improving the CLI as I go along, to make sure I've gotten
a good sample.

Scrubyt is not all sunshine and roses, however. I'm using an older
version because the newest one, circa about a year ago, has conflicts
with a basic rails installation. Documentation is not-so-great, and
there are perplexing bugs in this older version... but the ability to
generate scraping templates from example text is pretty amazing.

The idea of incorporating submissions in other languages is
interesting, and I like the idea of appealing to a wide variety of
programmers. But ulitmately, if there's a strong existing tool which
makes the process quick + easy, it may be best to use it.

Jennifer

Reply all

Reply to author

Forward