Fwd: Scraping infrastructure

16 views
Skip to first unread message

Ian Bicking

unread,
Jul 23, 2009, 2:20:13 PM7/23/09
to fifty-sta...@googlegroups.com
So, I was looking at some of the fiftystates code, and most of it didn't run.  There's a bunch of issues -- dependencies, PYTHONPATH, some error messages that aren't very clear, timeouts, etc.

Anyway, I think a bit of scraping infrastructure would be helpful.  I threw this together.  I'm at OSCON, so maybe during the hackathon we can look at it (and/or the idea generally), but I thought I'd just put it out here before I get distracted this evening. (OK... apparently I got distracted and this didn't successfully go out, retrying...)


The idea of the code is that there's a scraper object, which currently just has one real method "request", which you use like:

with Scraper().request('some_url') as page:
    do stuff with page

If there's an error it gives context of what is being scraped.  It also adds some logging.  Lots of other helpers would also be useful (e.g., BS parsing).

Generally given the fragile nature of this scraping, I think it will be very helpful to have really good error messages when things inevitably fail.  I called it errocontext for this reason, but then it started expanding and I didn't think of a new name ;)


--
Ian Bicking  |  http://blog.ianbicking.org  |  http://topplabs.org/civichacker
errorcontext.py

Adam Nelson

unread,
Jul 23, 2009, 2:28:51 PM7/23/09
to fifty-sta...@googlegroups.com
Ian,

I think it would be best to use a standard DOM tool like lxml:

http://codespeak.net/lxml/elementsoup.html

Their ElementSoup (which is really just a front end for BeautifulSoup)
can handle total garbage. I believe BeautifulSoup is already a
requirement - which should be fine for any scraping program as it's
the most effective way to deal with HTML pages.

-Adam

P.S. - This is just a comment - I haven't looked at the codebase in some time.
--
Adam Nelson

http://unhub.com/varud

Ian Bicking

unread,
Jul 23, 2009, 5:00:11 PM7/23/09
to fifty-sta...@googlegroups.com
On Thu, Jul 23, 2009 at 11:28 AM, Adam Nelson <ad...@varud.com> wrote:

Ian,

I think it would be best to use a standard DOM tool like lxml:

http://codespeak.net/lxml/elementsoup.html

Their ElementSoup (which is really just a front end for BeautifulSoup)
can handle total garbage.  I believe BeautifulSoup is already a
requirement - which should be fine for any scraping program as it's
the most effective way to deal with HTML pages.

Sure, lxml's native parsing is also quite competent, sometimes better than BS.  But that's another topic.  I'm more concerned about something like:

page = blah blah blah
link = page.find('a')[0].href
blah blah blah

Regardless of what you parse with, if something changes then that code will fail.  So errorcontext makes it clearer what happened when something like that fails, by showing the retrieved page whenever there is an exception.

Adam Nelson

unread,
Jul 23, 2009, 6:05:42 PM7/23/09
to fifty-sta...@googlegroups.com
> Sure, lxml's native parsing is also quite competent, sometimes better than
> BS.  But that's another topic.  I'm more concerned about something like:
> page = blah blah blah
> link = page.find('a')[0].href
> blah blah blah
> Regardless of what you parse with, if something changes then that code will
> fail.  So errorcontext makes it clearer what happened when something like
> that fails, by showing the retrieved page whenever there is an exception.

In that case, couldn't the page be logged for debugging and then use
lxml with good exception handling?

bx

unread,
Jul 24, 2009, 12:22:57 PM7/24/09
to fifty-sta...@googlegroups.com
Would people find it helpful if we kept an up-to-date list of all the
python libraries that are being used by each state? I was considering
hacking one together but I didn't think too many people were trying to
run _all_ of the scripts.

-Rebecca

Adam Nelson

unread,
Jul 24, 2009, 12:26:26 PM7/24/09
to fifty-sta...@googlegroups.com
I was thinking that myself. I like how Pinax does it:

http://github.com/pinax/pinax/blob/fa4ff2e2a464e3039f5d42c0bfdcfdb88d223743/requirements/external_apps.txt

I don't think it's sensible to lock version numbers unless necessary,
but otherwise, that's a good format.

Ian Bicking

unread,
Jul 24, 2009, 3:39:41 PM7/24/09
to fifty-sta...@googlegroups.com
On Fri, Jul 24, 2009 at 9:22 AM, bx <ms.sh...@gmail.com> wrote:
Would people find it helpful if we kept an up-to-date list of all the
python libraries that are being used by each state?  I was considering
hacking one together but I didn't think too many people were trying to
run _all_ of the scripts.

I think it would be quite valuable to make it easier to run all the scripts together, and figuring out the Python libraries necessary would of course be a good starting point.  Adam's suggestion of a requirement file works.  Then we could create a script that sets up the environment.

I wonder if another script would be useful to run everything, and report on what each state found.

I imagine you'd do something like:

  python scripts/pyutil/runall.py

and it would run everything, maybe in parallel, and when it finished it would tell which scripts failed and what kind of data each script created... I can imagine there might be some scripts that successfully finish, but don't find anything -- also just keeping track of the status of scripts, because they don't all extract the same data.

Ian Bicking

unread,
Jul 24, 2009, 7:39:48 PM7/24/09
to fifty-sta...@googlegroups.com
On Fri, Jul 24, 2009 at 12:39 PM, Ian Bicking <ianbi...@gmail.com> wrote:
On Fri, Jul 24, 2009 at 9:22 AM, bx <ms.sh...@gmail.com> wrote:
Would people find it helpful if we kept an up-to-date list of all the
python libraries that are being used by each state?  I was considering
hacking one together but I didn't think too many people were trying to
run _all_ of the scripts.

I think it would be quite valuable to make it easier to run all the scripts together, and figuring out the Python libraries necessary would of course be a good starting point.  Adam's suggestion of a requirement file works.  Then we could create a script that sets up the environment.

I wonder if another script would be useful to run everything, and report on what each state found.

Seemed reasonable so I gave it a go:


Not tested much at the moment, because I haven't had enough bandwidth to really test any working scraper.

Michael Stephens

unread,
Jul 24, 2009, 7:51:21 PM7/24/09
to fifty-sta...@googlegroups.com
Someone at the hackathon (possibly Ian) suggested setting up buildbot
or a similar tool to run all of the scrapers and track which ones are
failing/succeeding. I might look into putting that together this
weekend and make it publicly available so people who are curious don't
have to try running all of the scrapers on their local machine.

-- Michael Stephens

James Turner

unread,
Jul 26, 2009, 5:16:17 PM7/26/09
to fifty-sta...@googlegroups.com
For the perspective of a new guy here, but with a ton of software engineering experience.

  1. You not only want to lock version levels of dependent libraries, you want to include them in a third-party directory (or the like).  The reason is two-fold.
    • If someone (such as my) is unlucky enough to do a git clone right before heading off into the deep dark woods of Northern NH, where cell phone and WiFi is non-existent, they're going to suddenly find themselves unable to do development unless the dependent libraries have been included in the archive.
    • The last thing you want is two different state modules, each one dependent on a different version of the same library.
    • Changing to a different version of a dependent library should be an intentioned act, in the same way that changing an API is.  At a minimum, you need to test the new library version with regression checking against the existing code to make sure you didn't break anything by moving to a later version.  Which brings me to...
  2. IMHO, we need to start getting unit and microtests into this code, because I keep seeing discussions of things that used to work but don't now because the APIs got changed out from under them.  If we had microtests and unit-tests with mock data, that stuff could be caught by running the regression testing that you usual do before committing.
Test driven development can seem like a real pain in the ass (I loathe doing it), but if we want this code to last, we need to have a methodology in place that makes the code resilient.

That being said, I'm more than happy to set up the infrastructure (as far as a third-party directory, the appropriate relative links to load libraries into the state modules, and making a suggestion for how to set up testing).



James Turner
Correspondent for the Christian Science Monitor
Contributing Editor, O'Reilly Media

tur...@blackbear.biz
603-513-2383
Sent from Derry, New Hampshire, United States

Ian Bicking

unread,
Jul 27, 2009, 1:13:25 PM7/27/09
to fifty-sta...@googlegroups.com
On Sun, Jul 26, 2009 at 4:16 PM, James Turner <tur...@blackbear.biz> wrote:
For the perspective of a new guy here, but with a ton of software engineering experience.

  1. You not only want to lock version levels of dependent libraries, you want to include them in a third-party directory (or the like).  The reason is two-fold.
    • If someone (such as my) is unlucky enough to do a git clone right before heading off into the deep dark woods of Northern NH, where cell phone and WiFi is non-existent, they're going to suddenly find themselves unable to do development unless the dependent libraries have been included in the archive.
    • The last thing you want is two different state modules, each one dependent on a different version of the same library.
    • Changing to a different version of a dependent library should be an intentioned act, in the same way that changing an API is.  At a minimum, you need to test the new library version with regression checking against the existing code to make sure you didn't break anything by moving to a later version.  Which brings me to...

The requirements file that was mentioned earlier specifies all the libraries, and is pretty much as good as a download link.  Given that file you can also do:

  pip install -E my-env -r requirement-file.txt

That will install everything in a localized way.  You can also do:

  pip bundle 50states.pybundle -r requirement-file.txt

and it'll create a file (zip file, really) that contains all the source, which pip can install later, or which is relatively self-explanatory if you unzip it (and you can install with python setup.py install).  Of course not all of the scripts are Python, but... well, who knows what to do with the others.  I think they are all Ruby?  Each such script could handle its own path management, check versions, include any libraries directly, etc.
  1. IMHO, we need to start getting unit and microtests into this code, because I keep seeing discussions of things that used to work but don't now because the APIs got changed out from under them.  If we had microtests and unit-tests with mock data, that stuff could be caught by running the regression testing that you usual do before committing.

There's two kinds of code, and I think they should probably be tested differently -- the shared code (mostly pyutils/legislation.py), and then all the implementations (*/get_legislation.X).  It seems like it's going to be that the non-Python implementations are on their own. 

Anyway, I don't think there's a lot of effective testing to be done of the implementations, except functional testing (regularly running them and keeping track of what breaks).  I do think it would be useful to keep more history about how they were supposed to work -- specifically, copies of all the pages they fetched, from before they break.  Then you could compare that with a broken page and see what the original developer was expecting and hopefully get a good idea of a fix.

Test driven development can seem like a real pain in the ass (I loathe doing it), but if we want this code to last, we need to have a methodology in place that makes the code resilient.

Well, what makes this different is that the code *definitely* won't last.  Maybe it's not even intended to last -- we'd all like to see it made unnecessary through policy.  But I think testing could help us roll with the punches.

James Turner

unread,
Jul 27, 2009, 1:20:15 PM7/27/09
to fifty-sta...@googlegroups.com
The difference between a download link and actually having a copy of the library checked in is that it represents a snapshot of a version of the library that worked with the code that's checked in, so even if the library in the link gets changed out from under us, the code still uses the version we know works.

I think we could effective test the state code, if we broke things up a bit.  If, instead of:

get_legislation()

we have fetch_legislation and parse_legislation (the first one fetching the raw unscraped data, and the second parsing it), then we can mock up calls to parse_legislation using known good static data stored in a file.  This allows us to do regression testing, and to determine easily if a state module broke because the web page changed, or because something in the libraries changed.

James Turner
Correspondent for the Christian Science Monitor
Contributing Editor, O'Reilly Media

tur...@blackbear.biz
603-513-2383
Sent from North Billerica, MA, United States

Eric Mill

unread,
Jul 27, 2009, 2:16:45 PM7/27/09
to fifty-sta...@googlegroups.com
I'm generally a big fan of TDD and testing in general, but I have to
lean towards Ian here and say that it doesn't seem appropriate for
code which we very much expect to break for reasons outside our
control. One middle ground might be to test the functions in
pyutils/legislation.py, but the state scrapers just aren't doing
anything that makes sense to do mocked-out testing for. Setting up a
buildbot or other CI system whose test is to actually perform the
scraping seems like the best solution.

-- Eric

Adam Nelson

unread,
Jul 27, 2009, 2:18:51 PM7/27/09
to fifty-sta...@googlegroups.com
James,

I agree mostly with Ian regarding the external apps, but with you
regarding the unit testing. Your point about versions, multiple
directories, etc... is well taken, but this project doesn't have
enough manpower to monitor all the version changes in Python apps that
are going on out there - it has to be more of a bug report/fix bug
process for now. If there were a state directory that required a
different version of an app than another, I think it would be expected
that that state update their code - the Python app interfaces are
pretty stable. At some point, if there is an administrator who wants
to maintain a production branch, versions can be locked and proper
branching can happen - I don't see that happening any time soon
though.

As for unit-testing, I agree. This is a good starter guide for Django:

http://docs.djangoproject.com/en/dev/topics/testing/

Let's not forget that if the codebase is successful at the state
level, there's no reason it can't be the root of international and
local government initiatives which have a much more distant horizon.

Cheers,
Adam

Ian Bicking

unread,
Jul 27, 2009, 5:33:45 PM7/27/09
to fifty-sta...@googlegroups.com
On Mon, Jul 27, 2009 at 12:20 PM, James Turner <tur...@blackbear.biz> wrote:
The difference between a download link and actually having a copy of the library checked in is that it represents a snapshot of a version of the library that worked with the code that's checked in, so even if the library in the link gets changed out from under us, the code still uses the version we know works.

The libraries in question are all written by reasonably conscientious people who don't mess up versioning or tend to drop links.  But sure, we can also set up copies.  In some appengine projects I work on I set up a requirements file, which is just a list of packages and optionally versions, and then do:

  pip install -r requirements.txt --ignore-installed --install-option="--home=`pwd`"

This installs all the libraries in `pwd`/lib/python, and you can check them in (assuming they are portable; some things like lxml aren't).  Then I have the script (in, say, `pwd`/script) do:

  import os, site
  site.addsitedir(os.path.join(os.path.dirname(__file__), 'lib', 'python'))

On caveat: in appengine this is reliable because the system is guaranteed to be bare of other libraries, but in a normal environment if the person has installed a different version of the library globally then that version will take precedence, leading to considerable confusion.  You can make sure to give local libraries precedence, but it's a little trickier.  virtualenv solves this more elegantly in some ways, but you can't check in files from a virtualenv and then have someone check them out and run the code directly.

If we use addsitedir, a problem is that right now there's not a single entry point for all scripts.  I think we should change that.  Then it will be like:

  ./get_legislation mn

And that one script will find mn/get_legislation.py and run it.  If it finds get_legislation.rb, then it runs it with ruby and we lose the library management and whatnot, but we're no worse off than we are now.  This is mostly what runall.py does.  And if some people want to fix up the Ruby infrastructure, that could also be added to the runner script indirectly (i.e., instead of running "ruby state/get_legislation.rb" it runs "ruby rubyutil/runner.rb state/get_legislation.rb").

Oh, and we'd probably be packagizing the whole codebase (i.e., adding a setup.py).


I think we could effective test the state code, if we broke things up a bit.  If, instead of:

get_legislation()

we have fetch_legislation and parse_legislation (the first one fetching the raw unscraped data, and the second parsing it), then we can mock up calls to parse_legislation using known good static data stored in a file.  This allows us to do regression testing, and to determine easily if a state module broke because the web page changed, or because something in the libraries changed.

We can't separate out these two phases because the scrapers have to crawl the sites.  So you get a page, parse it, get some links, grab some more pages, and so on.  There is however some partitioning of the code through the abstract base class that all the (Python) implementations use.  I didn't find it that easy to understand, however, as it's not that well documented and there's a bunch of old methods lying around (I was trying to fix the Missouri code which used lots of deprecated APIs, but after naively toying with the code for a while I wasn't making any progress).  I'd like to see a lot more sanity checking in the shared code.  And maybe some more real objects instead of just dicts.

Anyway, this doesn't keep pyutil/legislation.py from being unittested, and I think we all agree that would be useful and the proper starting place for testing.  Well, for unit testing -- functional testing with a buildbot (or some other continuous integration tool) would also be really useful.  For the implementations, I think it would be useful to update some of them to use self.urlopen (and things like with self.soup_context(url)) and make those functions clever about storing histories and doing diffs and whatnot, which is I think how we can keep up with the continuous fixing that is required to keep them working.  There's also the possibility that a more complete package like Scrapy (http://scrapy.org/) would be useful... though looking at it just now, I'm not particularly impressed -- I think it's too complicated for the developer base of this project.

Michael Stephens

unread,
Jul 28, 2009, 12:23:15 AM7/28/09
to fifty-sta...@googlegroups.com
On Mon, Jul 27, 2009 at 2:33 PM, Ian Bicking<ianbi...@gmail.com> wrote:
> On Mon, Jul 27, 2009 at 12:20 PM, James Turner <tur...@blackbear.biz> wrote:
>>
>> The difference between a download link and actually having a copy of the
>> library checked in is that it represents a snapshot of a version of the
>> library that worked with the code that's checked in, so even if the library
>> in the link gets changed out from under us, the code still uses the version
>> we know works.
>
> The libraries in question are all written by reasonably conscientious people
> who don't mess up versioning or tend to drop links.  But sure, we can also
> set up copies.  In some appengine projects I work on I set up a requirements
> file, which is just a list of packages and optionally versions, and then do:
>   pip install -r requirements.txt --ignore-installed
> --install-option="--home=`pwd`"
> This installs all the libraries in `pwd`/lib/python, and you can check them
> in (assuming they are portable; some things like lxml aren't).  Then I have
> the script (in, say, `pwd`/script) do:
>   import os, site
>   site.addsitedir(os.path.join(os.path.dirname(__file__), 'lib', 'python'))

I'm not sure if listing the combined requirements for all of the
scrapers in one place is the right way to go - this seems to be
optimizing for a use case (running all 50 of the scrapers) that should
remain uncommon. Also, some of the individual scrapers have relatively
hefty dependencies (California depends on sqlalchemy and MySQL-python)
and it doesn't make sense to suggest that someone interested in
working on another state should install these. I would rather see
separate requirements lists for each individual state, along with a
(short) list for the scraping utils.

> We can't separate out these two phases because the scrapers have to crawl
> the sites.  So you get a page, parse it, get some links, grab some more
> pages, and so on.  There is however some partitioning of the code through
> the abstract base class that all the (Python) implementations use.  I didn't
> find it that easy to understand, however, as it's not that well documented
> and there's a bunch of old methods lying around (I was trying to fix the
> Missouri code which used lots of deprecated APIs, but after naively toying
> with the code for a while I wasn't making any progress).

To explain: the API has been in constant flux, rather there was a
single large refactoring of the API about a month and a half ago which
was made with the knowledge that it would break all of the scrapers
without active maintainers. On reflection it may have been better to
remove these broken scrapers from the repository to avoid confusion
until someone got around to updating them. I've updated Missouri (as
well as WV and GA) to use the current API and will try to make sure
that all of the other scrapers get updated or removed in the next few
days. I've also removed the remnants of the old API from
pyutils/legislation.py


> I'd like to see a
> lot more sanity checking in the shared code.

Part of the difficulty here is determining how much structure we can
require of the data we expect while remaining general enough to
support all 50 states.

-- Michael Stephens

Reply all
Reply to author
Forward
0 new messages