Status Report: Monday, August 31st, 2009

0 views
Skip to first unread message

bouvard

unread,
Sep 1, 2009, 12:20:24 AM9/1/09
to votersdaily
Hello again everyone,

I wanted to give everyone another update on how things are
progressing. I know most of you are following the repository, but its
not always obvious what progress is being made from that perspective.
Things have slowed down just a bit for us as both Chauncey and I have
had some slight alterations in the amount of time we can dedicate to
the project on a daily basis--nevertheless, progress has been made!
And quite a bit of it at that.

We have been making data integrity and consistency our number one
priority in the hopes that we can iron-out any changes that need to be
made to the CouchDB scheme early-on and thus lower the barrier to
entry for contributers. Since we started tracking tickets (2-3 days
into the project) we have created 71 and closed 64 of those, the
majority of them relating to data formatting and internal
consistency. The semi-final database schema is available here:

http://wiki.github.com/bouvard/votersdaily/database-planning

Of those tickets a smaller subset are related to new functionality,
which includes an array of command line parameters for working with
the scrapers, an automated scheduler (run.py) for running all the
scraper scripts, and a suite of validation tests (validate_couchdb.py
and validate_scripts.py).

But what I'm most excited about is that Chauncey has volunteered to
host a test database on his VPS so that others can take a look at what
we have so far. You can find the public CouchDB instance at:

http://www.congressspacebook.com:5984/_utils/

The scraper is not actively updating that instance (yet), but the
data is reasonably fresh. Perhaps more importantly, the instance is
up-to-date with the views which will eventually power the API. To see
example output, try these queries:

Executive branch activity since August 1st:
http://www.congressspacebook.com:5984/vd_events/_design/api/_view/Executive?endkey=%222009-08%22&descending=true

Supreme Court activity in the month of may:
http://www.congressspacebook.com:5984/vd_events/_design/api/_view/Supreme%20Court?startkey=%222009-07%22&endkey=%222009-06%22&descending=true

I have also been working on the API layer (built using Django) which
will be put over the top of CouchDB to clean up these URLs and provide
a standardized interface in the event we ever need to change the
underlying data structures. The repository for that project is
available here:

http://github.com/bouvard/votersdaily_web/tree/master

We have come a long way already on this project, but there is much
to be done. In addition to resolving our outstanding tickets, our
dual focuses at this point have to be expanding our dataset and
completing a first draft of the API application. In particular there
is much investigation to be done regarding the feasibility of scraping
data sources that are only available in PDF.

Your continued attention and support as our project matures is much
appreciated.

Thanks,
Chris

John Wonderlich

unread,
Sep 2, 2009, 1:58:03 PM9/2/09
to voter...@googlegroups.com
This is awesome.

I have a question.  (maybe two)

Will updates to scraped web pages be reflected in the database? (perhaps daily?)  Also, will the database link back to the original source?

I ask because schedule information becomes actionable when it's verifiable.  This is especially true of congressional committee schedules, which change with great frequency.

If a single web service provides updated information, reflecting changes made to the original, then those responsible for publishing this information (committee webmasters, in this example) have a positive incentive to see it succeed.  It gets their information out.

If the service doesn't reflect changes at the source, drive traffic to it, or allow for contacts/additional queries to be addressed to the proper channel, then the publishers will see such an effort as possibly producing more work.

Eric Mill

unread,
Sep 2, 2009, 2:35:49 PM9/2/09
to voter...@googlegroups.com
Listen to this man - he knows what he is talking about. :)

-- Eric

bouvard

unread,
Sep 2, 2009, 3:05:18 PM9/2/09
to votersdaily
Thanks for the questions John! Let me address them at clearly as I
can, but beginning with the caveat that its entirely possible that
there is a way to solve the "big problem" I'm about to describe--if
you know of one, please let me know. :-)

Each scraper that we have written is executed independently and
scrapes a single source URL (plus any necessary drill-down URLs).
This allows us to configure an update-check frequency for each scraper
in its config file. We haven't tweaked these values at all yet, but I
anticipate that different sources will have check intervals ranging
from 15 minutes, for the most important data, to 24 hours, for the
House holiday schedule. (Trying to avoid the bot-ban hammer, of
course.) Any events added to the page will be inserted into the
database in as close to real time as we can manage.

Now, as for "updating" data, that is a massively tricky problem that I
haven't come up with a technical solution to. We generate unique ids
for each document by cat'ing together the event datetime, the scraper
name, and a "key" (such as a vote number, or other unique index, or an
event title). These unique ids become the _id field in CouchDB and
are used for very fast duplicate checking each time the scraper runs.
Now CouchDB has built-in support for versioning, so in theory
supporting changes to documents would be fairly trivial on the back-
end. The problem is that I can deduce no reliable way of identifying
when a "row" of data has changed. As a result, if the datetime or key
value has changed in the source, then it would get an entirely new row
in our database. If other values have changed then it would get
dropped because the id would be duplicated.

Now, to allay what seems to be your primary fear: we log a source URL
for every single event. So anyone querying back from this database
can drive traffic to the original source. (I'm a huge advocate of
proper attribution for data.) We also log the exact time that we
accessed that source URL.

Back to the problem of "actual" duplicates (same event but the date
changed, for example) I foresee this being something we might be able
to handle on the fronted, by allowing users to flag erroneous data for
review. Of course there are technical things we can try with regard
to "changes" in the source--title matching, date matching, etc.
However, these are bound to be very error prone. What if an event is
deleted completely? The scraper would not have an intuitive way of
knowing if that is a cancellation, something passing beyond a 30 day
visibility window, or simply an error. What if six events on
different days genuinely have otherwise identical information? There
are a large number of edge cases that I'm not sure how to handle, so
my intuition is to simply gather all data (duplications if necessary,
though I expect them to be infrequent), and then have some sort of
post facto data integrity process. I would rather have too much than
too little.

(Aside: One significant benefit of having an API layer apart from
CouchDB is that I can easily build hooks back to us, such that anyone
who leverages the data can report problems for us to review.)

I'd love to hear more of your thoughts on these particularly tricky
problems.

Thanks again for the feedback,
Chris

On Sep 2, 10:58 am, John Wonderlich <johnwonderl...@gmail.com> wrote:
> This is awesome.
>
> I have a question.  (maybe two)
>
> Will updates to scraped web pages be reflected in the database? (perhaps
> daily?)  Also, will the database link back to the original source?
>
> I ask because schedule information becomes actionable when it's verifiable.
> This is especially true of congressional committee schedules, which change
> with great frequency.
>
> If a single web service provides updated information, reflecting changes
> made to the original, then those responsible for publishing this information
> (committee webmasters, in this example) have a positive incentive to see it
> succeed.  It gets their information out.
>
> If the service doesn't reflect changes at the source, drive traffic to it,
> or allow for contacts/additional queries to be addressed to the proper
> channel, then the publishers will see such an effort as possibly producing
> more work.
>
> >http://www.congressspacebook.com:5984/vd_events/_design/api/_view/Exe...
>
> > Supreme Court activity in the month of may:
>
> >http://www.congressspacebook.com:5984/vd_events/_design/api/_view/Sup...

bouvard

unread,
Sep 2, 2009, 3:13:34 PM9/2/09
to votersdaily
I also wanted to follow-up and say that Chauncey has been doing his
superman-PHP-hacker thing and has now scraped all the CSPAN-produced
House committee schedules. He is working on the Senate and we expect
this will bring our total dataset up to ~6000 events. As I understand
it he still has some code review to do, but the data looks solid and
it has already been merged into master. :-)

Chris
> Executive branch activity since August 1st:http://www.congressspacebook.com:5984/vd_events/_design/api/_view/Exe...
>
> Supreme Court activity in the month of may:http://www.congressspacebook.com:5984/vd_events/_design/api/_view/Sup...

John Wonderlich

unread,
Sep 2, 2009, 3:26:33 PM9/2/09
to voter...@googlegroups.com
Great answer.  I don't have any idea how to address the row-updating problem you describe, but I'd say this approach sounds exactly right:


What if six events on
different days genuinely have otherwise identical information?  There
are a large number of edge cases that I'm not sure how to handle, so
my intuition is to simply gather all data (duplications if necessary,
though I expect them to be infrequent), and then have some sort of
post facto data integrity process.  I would rather have too much than
too little.

...and that the edge cases you describe do indeed sound like a real barrier.

Also, you're right that including a source URL solves most of the basic problem.  Anyone attending an event based on scraped and syndicated data should call ahead (or double check) before attending in person.  (especially w/r/t committee hearings)

A contact number could be a useful field to include, if available.

Christopher Groskopf

unread,
Sep 2, 2009, 3:43:11 PM9/2/09
to voter...@googlegroups.com
Well its somewhat satisfying that I haven't overlooked an obvious
solution. ;-) As a general rule we are scraping all available data, so
if there is contact number available for a given event, it will
certainly be part of the dataset.

Chris
Reply all
Reply to author
Forward
0 new messages