Contributions now welcome!

1 view
Skip to first unread message

bouvard

unread,
Aug 15, 2009, 12:27:43 PM8/15/09
to votersdaily
Hey everyone, thanks for registering with this group! :-)

This morning I committed code for a scheduler to handle periodically
running the scrapers and also an example/tutorial scraper for people
to use as a model for their own. For those of you who have
contributed the Fifty States Project, this is all going to look _very_
familiar. If anyone would like to take a crack at one of the
calendars listed on the sources page, I have marked the one that I
have completed so far:

http://wiki.github.com/bouvard/votersdaily/sources

If anyone decides to look at the code or contribute I would really
like to hear any feedback on what has been done so far: the database
schema, scheduler, duplicate checking methodology, or anything else.
At this early point I can not guarantee that the base EventScraper
won't change, but hopefully we can identify anything that needs to be
tweaked sooner rather than later.

I'm hoping to write at least one more scraper this weekend, but I
haven't decided which to work on yet. I'll mark in "In Progress" on
the Sources page as soon as I decide.

Thanks again for everyone's interest!
Chris

Chauncey Thorn

unread,
Aug 15, 2009, 12:47:18 PM8/15/09
to voter...@googlegroups.com
Hello everyone.

I'm using PHP to parse the files. I've cloned the votersdaily repo. I will send you a pull request Chris to see if you're interest in adding it to your repo.

http://github.com/chaunceyt/votersdaily/tree/master
--
Chauncey Thorn
PHP Developer/Systems Administrator
email: chau...@gmail.com
url: http://www.cthorn.com/

Chauncey Thorn

unread,
Aug 15, 2009, 2:44:58 PM8/15/09
to voter...@googlegroups.com
Can I get you to check to ensure the .csv are as you expect re: data required?

I just pushed my inital parse of http://clerk.house.gov/evs/2009/ROLL_000.asp

At the moment I'm focusing on the parsing I will clean code ASAP.

Thanks

bouvard

unread,
Aug 15, 2009, 7:52:17 PM8/15/09
to votersdaily
Hey Chauncey!

Sorry, I'm just now getting back to you. I've been out this
afternoon. I'm looking at what you have committed so far and I
certainly admire your initiative! I had not anticipated someone so
quickly swooping in with another language to support, but I suppose
thats the way it is with projects like these! :-) I have to admit
though, that I'm a bit torn, because PHP isn't really a language I
know well, and its going to be difficult for me to support it as the
platform evolves. I'm not saying that I don't want to roll your code
in (the last thing I want is for effort to go to waste), but I do need
to really think about how I'm going to structure things so that I can
support multiple languages (Ruby here we come?) without surprising
myself with a lot of breaking changes. That said, I don't want to
leave you hanging, so here is what I have in mind:

1) I will rewrite the scheduler (run.py) to support executing
scripts from multiple languages. (You don't worry about a PHP-specific
scheduler, as we need to be able to execute all of them from a single
entry point.)
2) I will pull the script metadata data out into a configuration
file, such as Fifty States does with the STATUS file, so that this
information can be retrieved in a language-agnostic way.
3) If you can bring your VotersDaily_Abstract PHP class up to
feature parity with the Python EventScraper class then I will merge
your PHP utilities into the tree under scripts/phputils
4) Then we can merge the rest of your scrapers into their
respective directories.

I'm not sure what facilities PHP has for interacting with CouchDB,
but I really want to avoid temporarily storing the data on the
fileystem (as CSV, JSON, etc.) Because many of the potential uses for
this data are semi-real-time it doesn't really make sense to
preprocess all the data and then do a bulk import of those flat files.

Let me know what you think of this plan. I'm confident we can find
a way to get these codebases interacting happily. :-)

Thanks for your efforts,
Chris

On Aug 15, 11:44 am, Chauncey Thorn <chaunc...@gmail.com> wrote:
> Can I get you to check to ensure the .csv are as you expect re: data
> required?
>
> I just pushed my inital parse ofhttp://clerk.house.gov/evs/2009/ROLL_000.asp
>
> At the moment I'm focusing on the parsing I will clean code ASAP.
>
> Thanks
>
> On Sat, Aug 15, 2009 at 12:47 PM, Chauncey Thorn <chaunc...@gmail.com>wrote:
>
>
>
> > Hello everyone.
>
> > I'm using PHP to parse the files. I've cloned the votersdaily repo. I will
> > send you a pull request Chris to see if you're interest in adding it to your
> > repo.
>
> >http://github.com/chaunceyt/votersdaily/tree/master
>
> > On Sat, Aug 15, 2009 at 12:27 PM, bouvard <staringmon...@gmail.com> wrote:
>
> >> Hey everyone, thanks for registering with this group! :-)
>
> >>  This morning I committed code for a scheduler to handle periodically
> >> running the scrapers and also an example/tutorial scraper for people
> >> to use as a model for their own.  For those of you who have
> >> contributed the Fifty States Project, this is all going to look _very_
> >> familiar.  If anyone would like to take a crack at one of the
> >> calendars listed on the sources page, I have marked the one that I
> >> have completed so far:
>
> >>http://wiki.github.com/bouvard/votersdaily/sources
>
> >>  If anyone decides to look at the code or contribute I would really
> >> like to hear any feedback on what has been done so far: the database
> >> schema, scheduler, duplicate checking methodology, or anything else.
> >> At this early point I can not guarantee that the base EventScraper
> >> won't change, but hopefully we can identify anything that needs to be
> >> tweaked sooner rather than later.
>
> >>  I'm hoping to write at least one more scraper this weekend, but I
> >> haven't decided which to work on yet.  I'll mark in "In Progress" on
> >> the Sources page as soon as I decide.
>
> >> Thanks again for everyone's interest!
> >> Chris
>
> > --
> > Chauncey Thorn
> > PHP Developer/Systems Administrator
> > email: chaunc...@gmail.com
> > url:http://www.cthorn.com/
>
> --
> Chauncey Thorn
> PHP Developer/Systems Administrator
> email: chaunc...@gmail.com
> url:http://www.cthorn.com/

Chauncey Thorn

unread,
Aug 15, 2009, 11:15:30 PM8/15/09
to voter...@googlegroups.com
Chris,
Thanks for the reply.

I'm really just creating a PHP port of your logic. But maintaining the table schema you defined ensuring the scraping results is the same.  One can store these results in CouchDB, csv, and mysql/pgsql/sqlite.

How often do you plan to scrape.
i.e. how often is this frequency = 6.0?

I really don't want to distract you from your original goal. As a solution I can turn my scraper into a webservice.

i.e. http://www.site.com/webservice/?method=scrapesite&site=HouseRollCallVotes&output=[xml|json]
You can then just make a call and it will scape and send you the results.

Thanks.
email: chau...@gmail.com
url: http://www.cthorn.com/

bouvard

unread,
Aug 15, 2009, 11:37:46 PM8/15/09
to votersdaily
Hey Chauncey,

Well my notion is that all this data is eventually going to back
some sort of a real-time site, which is to say, as current as the
government gets with its data. So my intention would be that there is
a set of scrapers--maybe 20-25 when they are all finished--which are
executed at an interval defined by the author. The frequency variable
is currently in hours. So, for instance, I can't imagine it being
useful to scrape the House Schedule more than a few times per day. It
isn't very likely to change. In fact, I will probably bump that up to
12 or 24 hours. A more frequently updated page might be scraped every
hour, or even more often and the scheduler will handle executing the
scrapers "on interval" (which is really just a long-running process
that periodically spins off new processes.)

Honestly, I really would like to include the two other scrapers you
wrote (for roll call votes) and I've already done the work of
rewriting the scheduler to be language-agnostic (which I was going to
have to do eventually, anyway), so I would love it if we could get the
code together so that the scheduler can fire off your PHP scripts and
the resulting values get tossed back into CouchDB. However, I
understand that this makes more work for you and that it might be more
straightforward to just do the scripts in Python from scratch and
leverage the existing EventScraper class. I will leave that up to
you. If you want to port the rest of the "framework" to PHP, I'm more
than happy to have it in the repository and call your scripts
alongside the Python ones. (As is done with Python and Ruby in the
Fifty States project.) I just can't do the up-front work to make that
happen. (My PHP-fu is weak.)

Either way, I'm happy, but I don't think I want to go the web
services route as it adds another layer of dependency, which I'm
trying to avoid. Ideally, I want all this data constantly streaming
in off the web and all going into one big glob in the CouchDB
backend. That big table can then be exposed as an API or have an
application built right over the top. Either way the data will all be
under one roof. Which is really my ultimate goal.

Either way, don't feel as though you've distracted me. Its caused
me get more prepared. :-)

Cheers,
Chris

On Aug 15, 8:15 pm, Chauncey Thorn <chaunc...@gmail.com> wrote:
> Chris,
> Thanks for the reply.
>
> I'm really just creating a PHP port of your logic. But maintaining the table
> schema you defined ensuring the *scraping results* is the *same*.  One can
> store these results in CouchDB, csv, and mysql/pgsql/sqlite.
>
> How often do you plan to scrape.
> i.e. how often is this frequency = 6.0?
>
> I really don't want to distract you from your original goal. As a solution I
> can turn my scraper into a webservice.
>
> i.e.http://www.site.com/webservice/?method=scrapesite&site=HouseRollCallV...[xml|json]
> email: chaunc...@gmail.com
> url:http://www.cthorn.com/

Chauncey Thorn

unread,
Aug 17, 2009, 3:17:45 PM8/17/09
to votersdaily
Chris,

I've manage to write scrapers for 13 of the urls listed on your source
page. Which I made a copy of
http://wiki.github.com/chaunceyt/votersdaily

I have the scraping results writing to what I call storageEngines:
CouchDB, iCal *.ics and csv. (mysql on the way)

maybe your run.py could execute my run.php ?

check out what I've done... (work in progress)
http://github.com/chaunceyt/votersdaily/tree/master

Thanks
Reply all
Reply to author
Forward
0 new messages