Library changes

2 views
Skip to first unread message

Joe Germuska

unread,
Jul 29, 2009, 5:54:36 PM7/29/09
to fifty-sta...@googlegroups.com
I probably should have counted to ten before sending this, but I need to register a little frustration.  Today marks the second or maybe third time where I've merged the sunlightlabs HEAD of fiftystates with my fork and found that my IL scraper doesn't run any more.

I am sure it's hard to consider all of the different volunteers when making changes to the core, especially when people come and go, etc...  but if I'm going to continue to try to keep the IL scraper working in a way which helps Fifty States, I hope we can figure out a way to keep this from happening over and over...

Thanks
Joe

-- 
Joe Germuska
J...@Germuska.com * http://blog.germuska.com    

"Participation. That's what's gonna save the human race." --Pete Seeger

Ian Bicking

unread,
Jul 29, 2009, 6:28:57 PM7/29/09
to fifty-sta...@googlegroups.com
On Wed, Jul 29, 2009 at 4:54 PM, Joe Germuska<j...@germuska.com> wrote:
> I probably should have counted to ten before sending this, but I need to
> register a little frustration.  Today marks the second or maybe third time
> where I've merged the sunlightlabs HEAD of fiftystates with my fork and
> found that my IL scraper doesn't run any more.

I was thinking about having something that allows you to capture all
the URLs that are fetched and then rerun in a simulation mode, where
there's no net calls. Then I couldn't figure out what the point is,
because it won't break, because breaks happen when the upstream pages
change. Now I remember: then you can run regression tests on all the
scrapers quickly when changes are made to libraries.

(Though that'd only work if your IL scraper goes upstream)

--
Ian Bicking | http://blog.ianbicking.org | http://topplabs.org/civichacker

Michael Stephens

unread,
Jul 29, 2009, 6:44:40 PM7/29/09
to fifty-sta...@googlegroups.com
On Wed, Jul 29, 2009 at 2:54 PM, Joe Germuska<j...@germuska.com> wrote:
> I probably should have counted to ten before sending this, but I need to
> register a little frustration.  Today marks the second or maybe third time
> where I've merged the sunlightlabs HEAD of fiftystates with my fork and
> found that my IL scraper doesn't run any more.

Having taken a quick look at the current code you have on GitHub, part
of the problem seems to be that you're never calling
LegislationScraper's __init__ method. The behavior of Python's super
function can be confusing, but you should be calling
super(ILLegislationScraper, self).__init__() rather than
super(LegislationScraper, self).__init__() as you currently are. The
reason that your scraper previously worked is that there was no
essential initialization code in LegislationScraper.__init__ until
recently.

Also, your use of util_scraper/scraper_cacher in util.py is not really
consistent with how the API is designed to be used; I'm not sure why
you're not calling urlopen() on your instance of ILLegislationScraper.
This could be made to work if there's a reason you'd like to structure
it this way.

-- Michael Stephens

Joe Germuska

unread,
Jul 29, 2009, 7:46:18 PM7/29/09
to fifty-sta...@googlegroups.com
On Jul 29, 2009, at 5:44 PM, Michael Stephens wrote:
> Having taken a quick look at the current code you have on GitHub, part
> of the problem seems to be that you're never calling
> LegislationScraper's __init__ method.

Thanks for pointing that out. I can fix that. Those members could
also be initialized in the class definition, which is the temporary
fix I implemented.

> Also, your use of util_scraper/scraper_cacher in util.py is not really
> consistent with how the API is designed to be used; I'm not sure why
> you're not calling urlopen() on your instance of ILLegislationScraper.
> This could be made to work if there's a reason you'd like to structure
> it this way.


It seems that the way the API is designed to be used requires far more
code to be tied up in a subclass of LegislationScraper than I'd like.
I found it much easier to work through the vagaries of data by
breaking things down into smaller pieces and moving them into
libraries which I could import and work with in an interactive shell.
Maybe this is just my own development style...

It might be just as well to separate out the idea of a cache-savvy URL
fetcher from the scraper into its own thing which can be used
whereever. (FWIW, I'd also like it to cache PDFs, and I may try to
make it do that at some point, although I have been pulled off of
direct work on the leg scraper for the last couple of weeks.)

Thanks for the feedback...
Joe

--
Joe Germuska
J...@Germuska.com * http://blog.germuska.com

"I felt so good I told the leader how to follow."
-- Sly Stone

Michael Stephens

unread,
Jul 29, 2009, 7:59:08 PM7/29/09
to fifty-sta...@googlegroups.com
On Wed, Jul 29, 2009 at 4:46 PM, Joe Germuska<j...@germuska.com> wrote:
> It might be just as well to separate out the idea of a cache-savvy URL
> fetcher from the scraper into its own thing which can be used
> whereever.

I agree, it would be nice to separate this functionality from
LegislationScraper.

-- Michael Stephens

Reply all
Reply to author
Forward
0 new messages