Thanks for the questions John! Let me address them at clearly as I
can, but beginning with the caveat that its entirely possible that
there is a way to solve the "big problem" I'm about to describe--if
you know of one, please let me know. :-)
Each scraper that we have written is executed independently and
scrapes a single source URL (plus any necessary drill-down URLs).
This allows us to configure an update-check frequency for each scraper
in its config file. We haven't tweaked these values at all yet, but I
anticipate that different sources will have check intervals ranging
from 15 minutes, for the most important data, to 24 hours, for the
House holiday schedule. (Trying to avoid the bot-ban hammer, of
course.) Any events added to the page will be inserted into the
database in as close to real time as we can manage.
Now, as for "updating" data, that is a massively tricky problem that I
haven't come up with a technical solution to. We generate unique ids
for each document by cat'ing together the event datetime, the scraper
name, and a "key" (such as a vote number, or other unique index, or an
event title). These unique ids become the _id field in CouchDB and
are used for very fast duplicate checking each time the scraper runs.
Now CouchDB has built-in support for versioning, so in theory
supporting changes to documents would be fairly trivial on the back-
end. The problem is that I can deduce no reliable way of identifying
when a "row" of data has changed. As a result, if the datetime or key
value has changed in the source, then it would get an entirely new row
in our database. If other values have changed then it would get
dropped because the id would be duplicated.
Now, to allay what seems to be your primary fear: we log a source URL
for every single event. So anyone querying back from this database
can drive traffic to the original source. (I'm a huge advocate of
proper attribution for data.) We also log the exact time that we
accessed that source URL.
Back to the problem of "actual" duplicates (same event but the date
changed, for example) I foresee this being something we might be able
to handle on the fronted, by allowing users to flag erroneous data for
review. Of course there are technical things we can try with regard
to "changes" in the source--title matching, date matching, etc.
However, these are bound to be very error prone. What if an event is
deleted completely? The scraper would not have an intuitive way of
knowing if that is a cancellation, something passing beyond a 30 day
visibility window, or simply an error. What if six events on
different days genuinely have otherwise identical information? There
are a large number of edge cases that I'm not sure how to handle, so
my intuition is to simply gather all data (duplications if necessary,
though I expect them to be infrequent), and then have some sort of
post facto data integrity process. I would rather have too much than
too little.
(Aside: One significant benefit of having an API layer apart from
CouchDB is that I can easily build hooks back to us, such that anyone
who leverages the data can report problems for us to review.)
I'd love to hear more of your thoughts on these particularly tricky
problems.
Thanks again for the feedback,
Chris
On Sep 2, 10:58 am, John Wonderlich <
johnwonderl...@gmail.com> wrote:
> This is awesome.
>
> I have a question. (maybe two)
>
> Will updates to scraped web pages be reflected in the database? (perhaps
> daily?) Also, will the database link back to the original source?
>
> I ask because schedule information becomes actionable when it's verifiable.
> This is especially true of congressional committee schedules, which change
> with great frequency.
>
> If a single web service provides updated information, reflecting changes
> made to the original, then those responsible for publishing this information
> (committee webmasters, in this example) have a positive incentive to see it
> succeed. It gets their information out.
>
> If the service doesn't reflect changes at the source, drive traffic to it,
> or allow for contacts/additional queries to be addressed to the proper
> channel, then the publishers will see such an effort as possibly producing
> more work.
>
> >
http://www.congressspacebook.com:5984/vd_events/_design/api/_view/Exe...
>
> > Supreme Court activity in the month of may:
>
> >
http://www.congressspacebook.com:5984/vd_events/_design/api/_view/Sup...