HTML for California statutes: Collaborate?

40 views
Skip to first unread message

Ari Hershowitz

unread,
May 31, 2011, 3:00:59 PM5/31/11
to fifty-sta...@googlegroups.com, James Turk, CalWonk
Hi All,

As a side project, I've converted California's statutes into structured html, with most internal references now hyperlinked: See calaw.tabulaw.com.  [Self promotion warning:] I've also written a few blog posts describing the process to add metadata back in to the CA statutes: blog.tabulaw.com.

As I've mentioned before on this list, tracking of bills or proposed legislation becomes more meaningful if it can be compared to existing legislation.

Please get in touch if you are interested in brainstorming or working with me to connect proposed legislation to the existing statutes to create a "legislative diff", or related improvements.  Even better would be thoughts on how we can get California's legislature to include this metadata in the original drafts of bills.


Best,

Ari




CalWonk

unread,
Jul 28, 2011, 2:39:23 PM7/28/11
to Open State Project
Hey Ari!

I think I may have some exciting news. I recently just spoke with CA's
Legislative Counsel, like THE Counsel (head lawyer), and she said the
California database is now accessible in xml through mysql. She wanted
me to make sure the IT folks I work with know how about it. She
directed me to this link: ftp://www.leginfo.ca.gov/pub/bill/

Let me know if this helps. She said they are getting ready to roll out
statues and proposed bills since 1900 so this is pretty exciting news
from my end. Obviously I'm not a developer so I would know anything
about mysql. But hopefully this helps. She's invited to meet with her
if I have any other questions so let me know if I can be of any other
assistance.

-Phil

Ari Hershowitz

unread,
Aug 1, 2011, 2:19:24 PM8/1/11
to fifty-sta...@googlegroups.com, CalWonk, Jason Wilson, Waldo Jaquith, grant.ve...@gmail.com
Hi Phil (and OpenStates list),

Thank you so much for the heads-up!  This seems like an unprecedented opening of data, and I think we can make some tremendous progress with opening California legislation in very little time, if we organize well.

I'd like to get together a team of people who are interested in creating a showcase site for California legislation, using this newly available data.  This will probably require  some funding, too, but a relatively modest amount, given the open source work that many people have already done.  What I have in mind is a free site that shows CA legislation, has a Mac-like time-machine to see point-in-time laws, shows instant redlining of new bills, and shows expert commentary and headnotes.

I'm cc'ing Grant Vergottini (CA legislative tech expert), Jason Wilson (Jones McClure) and Waldo Jaquith (State Decoded).

Let me know if you're interested and we can set up a virtual meet-up next week to discuss. 

Also, Phil, I plan to write about the ftp site on my blog (blog.tabulaw.com) -- let me know if there are any restrictions I should know about.


Best,
Ari


--
You received this message because you are subscribed to the Google Groups "Open State Project" group.
To post to this group, send email to fifty-sta...@googlegroups.com.
To unsubscribe from this group, send email to fifty-state-pro...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/fifty-state-project?hl=en.


Gregory Combs

unread,
Aug 1, 2011, 5:05:44 PM8/1/11
to fifty-sta...@googlegroups.com, CalWonk, Jason Wilson, Waldo Jaquith, grant.ve...@gmail.com
I may be confusing this CA data with something else, but I think I remember Ray Kiddie mentioning that he's worked on this to some extent ... at the very least, I think he'd be probably be interested in participating.

Ari Hershowitz

unread,
Aug 1, 2011, 5:29:10 PM8/1/11
to fifty-sta...@googlegroups.com, CalWonk, Jason Wilson, Waldo Jaquith, grant.ve...@gmail.com
There has been a variety of CA data available before, but this is the largest trove I've seen.  I may have missed it before; the ftp site now includes SQL scripts and instructions to keep the data up-to-date.

At the suggestion of Greg Wilson (self-described js hacker), we'll make this into a hackathon, to create a new "state of the art" site for CA legislation.  Please contact me if you're interested in participating or helping to organize.  (I wrote a brief blogpost to describe the idea at blog.tabulaw.com).

Best,
Ari
On Mon, Aug 1, 2011 at 2:05 PM, Gregory Combs <gco...@gmail.com> wrote:
I may be confusing this CA data with something else, but I think I remember Ray Kiddie mentioning that he's worked on this to some extent ... at the very least, I think he'd be probably be interested in participating.

--
You received this message because you are subscribed to the Google Groups "Open State Project" group.

amagni

unread,
Aug 1, 2011, 7:31:38 PM8/1/11
to Open State Project

RE: leginfo data

The raw dumps and the scripts to load the data into a MySQL db have
been there (updated daily, weekly, yearly) for quite some time., not
new.

In fact you will notice the prior years data at the same location you
describe.

We've talked about this in the San Francisco Sunlight Meetup group. We
currently already meet overlapping with the OpenSF Third Thursdays
monthly.

The issue with the leginfo raw dumps is that while they technically
deliver legislative documents into a MySQL db as SQLscript loadable -
the documents are data blobs - i.e. there is no actual schema for the
data nor information.

Its just the raw documents each loaded in as a monolithic blob into a
database.
I.e. its not a whole lot different than grabbing the same documents
right off the surface Web that are available at the site.

In the current blob format with no useful schema (other than to
effectively collate the documents in their entirety such as by date) ,
it would be more immediately useful to load the blobs into a
repository and to apply search / IR (information retrieval) technology
- which can out-of-the-box come along with a choice of JSON or XML
format for return of result sets.

In fact if you really wanted to get efficient about it and you felt
you had to have traditional RDBMS - with an actual schema - then you
could even use that as an interim step to get there without doing it
all by hand.

Check out : http://www.meetup.com/SunlightFoundation/San-Francisco-CA/308991/#initialized

or @vividsocialnet on twitter

Ari Hershowitz

unread,
Aug 1, 2011, 7:59:06 PM8/1/11
to fifty-sta...@googlegroups.com
Interesting; most of the files seem to be dated from May 2011 but I must have missed them when I looked before.  Disappointing that they are not more structured, but we now have enough pieces* of the puzzle to put the structure back in.  

A number of folks have expressed an interest in a hackathon to do this, and it'd be great to discuss this with folks from the Third Thursday/Open SF meetings. Sounds like a great group.

*I built a parser that captures much of the information we'd need to make a "point-in-time" display (see calaw.tabulaw.com) with the historical data. I've also spoken with Grant Vergottini, who built the CA legislature's authoring system (and wrote on this list recently), who can provide a ton of insight into this data and how the structure can be re-inserted.


Best,
Ari

Grant Vergottini

unread,
Aug 1, 2011, 9:29:30 PM8/1/11
to fifty-sta...@googlegroups.com, fifty-sta...@googlegroups.com
Hi Ari, everyone,

The database dump has been available for a little over a year now. It was in response to a request, expressed in the form of a law suit, by maplight.org. The data is extracted from the Oracle database early each morning. The CAML files you see as blobs come out of Oracle XDB where they are stored as shredded data. While I designed the schema for this data, it will have to be released by The California Legislative Counsel. It is their property.

I derive my XML rendition by combining the text files at leginfo.ca.gov with aspects from the database. My experience is that corrections and updates are more reliable on Leginfo if you need the daily data to be accurate. There are major changes coming with leginfo soon. I can describe this all in a phone call.

I would very much like to get involved. My hobby site is http://legix.info and my commercial site is legisweb.com. I have differencing on the legisweb.com site. Sign up for a free trial if you want to see it in action. The capability is still available, without bill tracking, after trial expiration.

I am out of the office this week. I am checking email though.

Regards,
  Grant


Sent from my iPhone

Ari Hershowitz

unread,
Aug 1, 2011, 10:13:20 PM8/1/11
to fifty-sta...@googlegroups.com
Thanks, Grant, for adding this clarity.  I have spent a lot of time exploring the leginfo.ca.gov site and did not find these files; the ftp site I was able to find lead to the txt files for California's 2010 codes.

It looks like the elements for a productive hackathon are coming together, and it would be especially good if, with your guidance (and perhaps the blessing of the Leg. Counsel), we can build structured data elements to work with, as a starting point for the hackathon.

To avoid (further) clogging the fifty-state list, I will send further discussion on this thread to those who have shown interest; if you haven't yet, do let me know you want to be included.

Ari

Dave G

unread,
Aug 20, 2011, 4:32:18 PM8/20/11
to Open State Project, a...@tabulaw.com, grant.ve...@gmail.com
Hey all,

A while back I discussed my project to try and parse the legislative
*analyses* (done by staffers) available through LegInfo (e.g.,
http://leginfo.ca.gov/pub/11-12/bill/asm/ab_0101-0150/ab_136_cfa_20110331_163616_asm_comm.html
) for more meaningful, less basic data.

The most obvious (and technically most feasible) starting point was
the "Registed Support/Opposition" at the bottom of these docs. I've
got a decent (if messy) scraper that collects most of this and spits
it out into a CSV. It needs work in the cleanup dept, and the scraping
still doesn't work on every formatting variation of the analyses (they
vary in structure by committee).

The idea of an end-product would be a "genealogy of influence": a
mapping of interconnections among interest groups which would be very
interesting to analyze.

The analyses provide a great deal more in terms of substance (fiscal
estimates, staff comments, summaries of existing law, etc.) that -- if
structured -- could be seriously useful in terms of being able to sort
bills by related values along these dimensions. One could fathom
easily sorting all Health & Human Services bills by projected fiscal
impact, high-to-low.

Ari -- your posted link just seems to hit a Django starter page, but
it'd be really neat if your work could be used to provide an API that
could take as an argument a plaintext statutory reference and return a
link to that specified section (then it'd be relatively easy to mark-
up docs to use link when it references a statute).

Anyway; I'm out on the East Coast now but happy to collaborate. My
current work uses Scrapy, and uses the metadata in the analyses (Grant
perhaps this is your work?), so I can see it not being terribly
difficult to integrate.

Also feel free to drop a line off-list. daguar [{at]} gmai|

Dave



On Aug 1, 7:13 pm, Ari Hershowitz <a...@tabulaw.com> wrote:
> Thanks, Grant, for adding this clarity.  I have spent a lot of time
> exploring the leginfo.ca.gov site and did not find these files; the ftp site
> I was able to find lead to the txt files for California's 2010 codes.
>
> It looks like the elements for a productive hackathon are coming together,
> and it would be especially good if, with your guidance (and perhaps the
> blessing of the Leg. Counsel), we can build structured data elements to work
> with, as a starting point for the hackathon.
>
> To avoid (further) clogging the fifty-state list, I will send further
> discussion on this thread to those who have shown interest; if you haven't
> yet, do let me know you want to be included.
>
> Ari
>
> On Mon, Aug 1, 2011 at 6:29 PM, Grant Vergottini <grant.vergott...@gmail.com
>
>
>
>
>
>
>
> > wrote:
> > Hi Ari, everyone,
>
> > The database dump has been available for a little over a year now. It was
> > in response to a request, expressed in the form of a law suit, by
> > maplight.org. The data is extracted from the Oracle database early each
> > morning. The CAML files you see as blobs come out of Oracle XDB where they
> > are stored as shredded data. While I designed the schema for this data, it
> > will have to be released by The California Legislative Counsel. It is their
> > property.
>
> > I derive my XML rendition by combining the text files at leginfo.ca.govwith aspects from the database. My experience is that corrections and
> > updates are more reliable on Leginfo if you need the daily data to be
> > accurate. There are major changes coming with leginfo soon. I can describe
> > this all in a phone call.
>
> > I would very much like to get involved. My hobby site ishttp://legix.infoandmy commercial site is
> > legisweb.com. I have differencing on the legisweb.com site. Sign up for a
> > free trial if you want to see it in action. The capability is still
> > available, without bill tracking, after trial expiration.
>
> > I am out of the office this week. I am checking email though.
>
> > Regards,
> >   Grant
>
> > Sent from my iPhone
>
> > On Aug 1, 2011, at 4:59 PM, Ari Hershowitz <a...@tabulaw.com> wrote:
>
> > Interesting; most of the files seem to be dated from May 2011 but I must
> > have missed them when I looked before.  Disappointing that they are not more
> > structured, but we now have enough pieces* of the puzzle to put the
> > structure back in.
>
> > A number of folks have expressed an interest in a hackathon to do this, and
> > it'd be great to discuss this with folks from the Third Thursday/Open SF
> > meetings. Sounds like a great group.
>
> > *I built a parser that captures much of the information we'd need to make a
> > "point-in-time" display (see <http://calaw.tabulaw.com>calaw.tabulaw.com)
> > with the historical data. I've also spoken with Grant Vergottini, who built
> > the CA legislature's authoring system (and wrote on this list recently), who
> > can provide a ton of insight into this data and how the structure can be
> > re-inserted.
>
> > Best,
> > Ari
>
> >> <http://www.meetup.com/SunlightFoundation/San-Francisco-CA/308991/#ini...>
> >>http://www.meetup.com/SunlightFoundation/San-Francisco-CA/308991/#ini...
>
> >> or  @vividsocialnet on twitter
>
> >> On Aug 1, 2:29 pm, Ari Hershowitz <a...@tabulaw.com> wrote:
> >> > There has been a variety of CA data available before, but this is the
> >> > largest trove I've seen.  I may have missed it before; the ftp site now
> >> > includes SQL scripts and instructions to keep the data up-to-date.
>
> >> > At the suggestion of Greg Wilson (self-described js hacker), we'll make
> >> this
> >> > into a hackathon, to create a new "state of the art" site for CA
> >> > legislation.  Please contact me if you're interested in participating or
> >> > helping to organize.  (I wrote a brief blogpost to describe the idea at
> >> > <http://blog.tabulaw.com>blog.tabulaw.com).
>
> >> > Best,
> >> > Ari
>
> >> > On Mon, Aug 1, 2011 at 2:05 PM, Gregory Combs <gco...@gmail.com> wrote:
> >> > > I may be confusing this CA data with something else, but I think I
> >> remember
> >> > > Ray Kiddie mentioning that he's worked on this to some extent ... at
> >> the
> >> > > very least, I think he'd be probably be interested in participating.
>
> >> > > --
> >> > > You received this message because you are subscribed to the Google
> >> Groups
> >> > > "Open State Project" group.
> >> > > To view this discussion on the web visit
> >> > > <https://groups.google.com/d/msg/fifty-state-project/-/6D05iYHS4YAJ>
> >>https://groups.google.com/d/msg/fifty-state-project/-/6D05iYHS4YAJ.
>
> >> > > To post to this group, send email to
> >> <fifty-sta...@googlegroups.com>
> >> fifty-sta...@googlegroups.com.
> >> > > To unsubscribe from this group, send email to
> >> > > <fifty-state-project%2Bunsu...@googlegroups.com>
> >> fifty-state-pro...@googlegroups.com.
> >> > > For more options, visit this group at
> >> > > <http://groups.google.com/group/fifty-state-project?hl=en>
> >>http://groups.google.com/group/fifty-state-project?hl=en.
>
> >> --
> >> You received this message because you are subscribed to the Google Groups
> >> "Open State Project" group.
> >> To post to this group, send email to
> >> <fifty-sta...@googlegroups.com>
> >> fifty-sta...@googlegroups.com.
> >> To unsubscribe from this group, send email to
> >> <fifty-state-project%2Bunsu...@googlegroups.com>

Ari Hershowitz

unread,
Aug 22, 2011, 11:57:16 AM8/22/11
to Dave G, Open State Project, grant.ve...@gmail.com
Hi Dave,

I'll also write to you separately, but thought it'd be good to update the group on progress toward the hackathon.

It would be great to add legislative analysis to a site of California legislation as you suggest.  The initial goal of the hackathon is modest, focusing on the California Codes, with the hope of building a foundation for a wide range of data.  My CA Codes site was down due to an AWS hardware problem; I've now migrated and you should now be able to see it live: calaw.tabulaw.com. The UI code and parsers are at github here.

The hackathon is picking up steam and we welcome others' ideas and input:

* A number of organizations have offered sponsorship, thanks to Philip Ung's excellent work
* San Francisco will be the physical base of the hackathon, with tools to participate remotely
* September 17-18 weekend are now the tentative dates, rather than Labor day weekend.
* To get hackers started, we are building a package of baseline data, and tools to access California Laws API  in a ruby gem, python package or both.
* Planning and details are being added here: http://code.google.com/p/calaw/w/list
* Let me know if you would like to be added to the Google code site to add your ideas/suggestions to the wiki


Best,
Ari

-- 
Ari Hershowitz
Tabulaw, Inc.
555 Mission St., 34th Floor
San Francisco, CA 94105
Reply all
Reply to author
Forward
0 new messages