I was thinking about having something that allows you to capture all
the URLs that are fetched and then rerun in a simulation mode, where
there's no net calls. Then I couldn't figure out what the point is,
because it won't break, because breaks happen when the upstream pages
change. Now I remember: then you can run regression tests on all the
scrapers quickly when changes are made to libraries.
(Though that'd only work if your IL scraper goes upstream)
--
Ian Bicking | http://blog.ianbicking.org | http://topplabs.org/civichacker
Having taken a quick look at the current code you have on GitHub, part
of the problem seems to be that you're never calling
LegislationScraper's __init__ method. The behavior of Python's super
function can be confusing, but you should be calling
super(ILLegislationScraper, self).__init__() rather than
super(LegislationScraper, self).__init__() as you currently are. The
reason that your scraper previously worked is that there was no
essential initialization code in LegislationScraper.__init__ until
recently.
Also, your use of util_scraper/scraper_cacher in util.py is not really
consistent with how the API is designed to be used; I'm not sure why
you're not calling urlopen() on your instance of ILLegislationScraper.
This could be made to work if there's a reason you'd like to structure
it this way.
-- Michael Stephens
Thanks for pointing that out. I can fix that. Those members could
also be initialized in the class definition, which is the temporary
fix I implemented.
> Also, your use of util_scraper/scraper_cacher in util.py is not really
> consistent with how the API is designed to be used; I'm not sure why
> you're not calling urlopen() on your instance of ILLegislationScraper.
> This could be made to work if there's a reason you'd like to structure
> it this way.
It seems that the way the API is designed to be used requires far more
code to be tied up in a subclass of LegislationScraper than I'd like.
I found it much easier to work through the vagaries of data by
breaking things down into smaller pieces and moving them into
libraries which I could import and work with in an interactive shell.
Maybe this is just my own development style...
It might be just as well to separate out the idea of a cache-savvy URL
fetcher from the scraper into its own thing which can be used
whereever. (FWIW, I'd also like it to cache PDFs, and I may try to
make it do that at some point, although I have been pulled off of
direct work on the leg scraper for the last couple of weeks.)
Thanks for the feedback...
Joe
--
Joe Germuska
J...@Germuska.com * http://blog.germuska.com
"I felt so good I told the leader how to follow."
-- Sly Stone
I agree, it would be nice to separate this functionality from
LegislationScraper.
-- Michael Stephens