Ian,
I think it would be best to use a standard DOM tool like lxml:
http://codespeak.net/lxml/elementsoup.html
Their ElementSoup (which is really just a front end for BeautifulSoup)
can handle total garbage. I believe BeautifulSoup is already a
requirement - which should be fine for any scraping program as it's
the most effective way to deal with HTML pages.
In that case, couldn't the page be logged for debugging and then use
lxml with good exception handling?
-Rebecca
Would people find it helpful if we kept an up-to-date list of all the
python libraries that are being used by each state? I was considering
hacking one together but I didn't think too many people were trying to
run _all_ of the scripts.
On Fri, Jul 24, 2009 at 9:22 AM, bx <ms.sh...@gmail.com> wrote:Would people find it helpful if we kept an up-to-date list of all the
python libraries that are being used by each state? I was considering
hacking one together but I didn't think too many people were trying to
run _all_ of the scripts.
I think it would be quite valuable to make it easier to run all the scripts together, and figuring out the Python libraries necessary would of course be a good starting point. Adam's suggestion of a requirement file works. Then we could create a script that sets up the environment.I wonder if another script would be useful to run everything, and report on what each state found.
For the perspective of a new guy here, but with a ton of software engineering experience.
- You not only want to lock version levels of dependent libraries, you want to include them in a third-party directory (or the like). The reason is two-fold.
- If someone (such as my) is unlucky enough to do a git clone right before heading off into the deep dark woods of Northern NH, where cell phone and WiFi is non-existent, they're going to suddenly find themselves unable to do development unless the dependent libraries have been included in the archive.
- The last thing you want is two different state modules, each one dependent on a different version of the same library.
- Changing to a different version of a dependent library should be an intentioned act, in the same way that changing an API is. At a minimum, you need to test the new library version with regression checking against the existing code to make sure you didn't break anything by moving to a later version. Which brings me to...
- IMHO, we need to start getting unit and microtests into this code, because I keep seeing discussions of things that used to work but don't now because the APIs got changed out from under them. If we had microtests and unit-tests with mock data, that stuff could be caught by running the regression testing that you usual do before committing.
Test driven development can seem like a real pain in the ass (I loathe doing it), but if we want this code to last, we need to have a methodology in place that makes the code resilient.
The difference between a download link and actually having a copy of the library checked in is that it represents a snapshot of a version of the library that worked with the code that's checked in, so even if the library in the link gets changed out from under us, the code still uses the version we know works.
I think we could effective test the state code, if we broke things up a bit. If, instead of:get_legislation()we have fetch_legislation and parse_legislation (the first one fetching the raw unscraped data, and the second parsing it), then we can mock up calls to parse_legislation using known good static data stored in a file. This allows us to do regression testing, and to determine easily if a state module broke because the web page changed, or because something in the libraries changed.
I'm not sure if listing the combined requirements for all of the
scrapers in one place is the right way to go - this seems to be
optimizing for a use case (running all 50 of the scrapers) that should
remain uncommon. Also, some of the individual scrapers have relatively
hefty dependencies (California depends on sqlalchemy and MySQL-python)
and it doesn't make sense to suggest that someone interested in
working on another state should install these. I would rather see
separate requirements lists for each individual state, along with a
(short) list for the scraping utils.
> We can't separate out these two phases because the scrapers have to crawl
> the sites. So you get a page, parse it, get some links, grab some more
> pages, and so on. There is however some partitioning of the code through
> the abstract base class that all the (Python) implementations use. I didn't
> find it that easy to understand, however, as it's not that well documented
> and there's a bunch of old methods lying around (I was trying to fix the
> Missouri code which used lots of deprecated APIs, but after naively toying
> with the code for a while I wasn't making any progress).
To explain: the API has been in constant flux, rather there was a
single large refactoring of the API about a month and a half ago which
was made with the knowledge that it would break all of the scrapers
without active maintainers. On reflection it may have been better to
remove these broken scrapers from the repository to avoid confusion
until someone got around to updating them. I've updated Missouri (as
well as WV and GA) to use the current API and will try to make sure
that all of the other scrapers get updated or removed in the next few
days. I've also removed the remnants of the old API from
pyutils/legislation.py
> I'd like to see a
> lot more sanity checking in the shared code.
Part of the difficulty here is determining how much structure we can
require of the data we expect while remaining general enough to
support all 50 states.
-- Michael Stephens