Hey all,
A while back I discussed my project to try and parse the legislative
*analyses* (done by staffers) available through LegInfo (e.g.,
http://leginfo.ca.gov/pub/11-12/bill/asm/ab_0101-0150/ab_136_cfa_20110331_163616_asm_comm.html
) for more meaningful, less basic data.
The most obvious (and technically most feasible) starting point was
the "Registed Support/Opposition" at the bottom of these docs. I've
got a decent (if messy) scraper that collects most of this and spits
it out into a CSV. It needs work in the cleanup dept, and the scraping
still doesn't work on every formatting variation of the analyses (they
vary in structure by committee).
The idea of an end-product would be a "genealogy of influence": a
mapping of interconnections among interest groups which would be very
interesting to analyze.
The analyses provide a great deal more in terms of substance (fiscal
estimates, staff comments, summaries of existing law, etc.) that -- if
structured -- could be seriously useful in terms of being able to sort
bills by related values along these dimensions. One could fathom
easily sorting all Health & Human Services bills by projected fiscal
impact, high-to-low.
Ari -- your posted link just seems to hit a Django starter page, but
it'd be really neat if your work could be used to provide an API that
could take as an argument a plaintext statutory reference and return a
link to that specified section (then it'd be relatively easy to mark-
up docs to use link when it references a statute).
Anyway; I'm out on the East Coast now but happy to collaborate. My
current work uses Scrapy, and uses the metadata in the analyses (Grant
perhaps this is your work?), so I can see it not being terribly
difficult to integrate.
Also feel free to drop a line off-list. daguar [{at]} gmai|
Dave
On Aug 1, 7:13 pm, Ari Hershowitz <
a...@tabulaw.com> wrote:
> Thanks, Grant, for adding this clarity. I have spent a lot of time
> exploring the
leginfo.ca.gov site and did not find these files; the ftp site
> I was able to find lead to the txt files for California's 2010 codes.
>
> It looks like the elements for a productive hackathon are coming together,
> and it would be especially good if, with your guidance (and perhaps the
> blessing of the Leg. Counsel), we can build structured data elements to work
> with, as a starting point for the hackathon.
>
> To avoid (further) clogging the fifty-state list, I will send further
> discussion on this thread to those who have shown interest; if you haven't
> yet, do let me know you want to be included.
>
> Ari
>
> On Mon, Aug 1, 2011 at 6:29 PM, Grant Vergottini <
grant.vergott...@gmail.com
>
>
>
>
>
>
>
> > wrote:
> > Hi Ari, everyone,
>
> > The database dump has been available for a little over a year now. It was
> > in response to a request, expressed in the form of a law suit, by
> >
maplight.org. The data is extracted from the Oracle database early each
> > morning. The CAML files you see as blobs come out of Oracle XDB where they
> > are stored as shredded data. While I designed the schema for this data, it
> > will have to be released by The California Legislative Counsel. It is their
> > property.
>
> > I derive my XML rendition by combining the text files at leginfo.ca.govwith aspects from the database. My experience is that corrections and
> > updates are more reliable on Leginfo if you need the daily data to be
> > accurate. There are major changes coming with leginfo soon. I can describe
> > this all in a phone call.
>
> > I would very much like to get involved. My hobby site ishttp://legix.infoandmy commercial site is
> >
legisweb.com. I have differencing on the
legisweb.com site. Sign up for a
> > free trial if you want to see it in action. The capability is still
> > available, without bill tracking, after trial expiration.
>
> > I am out of the office this week. I am checking email though.
>
> > Regards,
> > Grant
>
> > Sent from my iPhone
>
> > On Aug 1, 2011, at 4:59 PM, Ari Hershowitz <
a...@tabulaw.com> wrote:
>
> > Interesting; most of the files seem to be dated from May 2011 but I must
> > have missed them when I looked before. Disappointing that they are not more
> > structured, but we now have enough pieces* of the puzzle to put the
> > structure back in.
>
> > A number of folks have expressed an interest in a hackathon to do this, and
> > it'd be great to discuss this with folks from the Third Thursday/Open SF
> > meetings. Sounds like a great group.
>
> > *I built a parser that captures much of the information we'd need to make a
> > "point-in-time" display (see <
http://calaw.tabulaw.com>
calaw.tabulaw.com)
> > with the historical data. I've also spoken with Grant Vergottini, who built
> > the CA legislature's authoring system (and wrote on this list recently), who
> > can provide a ton of insight into this data and how the structure can be
> > re-inserted.
>
> > Best,
> > Ari
>
> >> <
http://www.meetup.com/SunlightFoundation/San-Francisco-CA/308991/#ini...>
> >>
http://www.meetup.com/SunlightFoundation/San-Francisco-CA/308991/#ini...
>
> >> or @vividsocialnet on twitter
>
> >> On Aug 1, 2:29 pm, Ari Hershowitz <
a...@tabulaw.com> wrote:
> >> > There has been a variety of CA data available before, but this is the
> >> > largest trove I've seen. I may have missed it before; the ftp site now
> >> > includes SQL scripts and instructions to keep the data up-to-date.
>
> >> > At the suggestion of Greg Wilson (self-described js hacker), we'll make
> >> this
> >> > into a hackathon, to create a new "state of the art" site for CA
> >> > legislation. Please contact me if you're interested in participating or
> >> > helping to organize. (I wrote a brief blogpost to describe the idea at
> >> > <
http://blog.tabulaw.com>
blog.tabulaw.com).
> >> > > <
fifty-state-project%2Bunsu...@googlegroups.com>
> >> <
fifty-state-project%2Bunsu...@googlegroups.com>