How would you legislative data to be made available?

Rob Pierson

unread,

Oct 2, 2007, 6:35:38 PM10/2/07

to openhous...@googlegroups.com

As the Library of Congress (i.e. thomas/LIS) completes their work converting the legislative summaries into XML, they are doing research into what system the new legislative database might be made available. I'm going to be working on ensuring that non-hill folks have access to a bill search system that is as capable as what is available to congressional staff, but for the moment I'd like to address the question of the raw data.

Rather than waiting for LoC to produce a proposal of how that legislative data should be made available, I think it makes sense for this group to preemptively offer ideas about the way that raw legislative data should be provided for repurposing on other websites. We should consider the needs of sites which will repurpose the data, but at the same time the database format recommended by this group must minimize the webserver impact of sites like Govtrack.

Questions:

(please answer other important questions even if I don't know that I should pose them) :)

What file formats/system would you recommend? Is a complete dump of the entire database necessary?
How could sites be made aware of changes to the system? Rather than accessing every bill record every night, is there a way that sites could only access records that had been updated (i.e. new cosponsors, bill action, etc).
Is it important that RSS feeds be made available for search terms? For example, an RSS feed for all new bills that contain the word Iraq in the text.
What's the work around for this need today?

Thanks everyone!

Josh Tauberer

unread,

Oct 4, 2007, 10:34:06 AM10/4/07

to openhous...@googlegroups.com

Rob Pierson wrote:
> * What file formats/system would you recommend?

HTTP REST-based (i.e. GET) API for getting records, or just HTTP/FTP
access to files directly. No SOAP, web services, or whatever. Simple
simple simple.

> Is a complete dump
> of the entire database necessary?

It would be a good idea, for sure. Considering THOMAS has records for
somewhere in the ballpark of 200,000 bills (probably around 1GB of data,
based on my own database), if you want all of it, no one is going to be
happy with 200,000 uncompressed HTTP requests (esp. at their current
maxmimum permitted rate of one per second). If you're trying to get a
new project going, you might want the whole database.

> * How could sites be made aware of changes to the system? Rather

> than accessing every bill record every night, is there a way that
> sites could only access records that had been updated (i.e. new
> cosponsors, bill action, etc).

That's an absolute must. That's one of the biggest problems I have with
GovTrack. Not all bill updates are reflected in the Daily Digest, and
there's no other way to get a list of changed bills. (The D.D. is also
not machine readable...)

That could be done simply by updating a file with the last modified time
of each record any time a record is modified, or by making a dynamic
page that gives all modified records within a given time frame.
(Critically, these pages should at the very least cover 7 days of
changes in one request and not require paging through 1-50, 51-100, etc.
That's so annoying.) This *could* be done in RSS, which would sort of
make use of standard date formats and things, so long as it refers to
records unambiguously, and that might give it a dual use for
individuals. But, that might be unnecessary.

> * Is it important that RSS feeds be made available for search terms?

> For example, an RSS feed for all new bills that contain the word
> Iraq in the text.

This shouldn't be a point that slows down anything else. RSS feeds by
LIV terms (as I do) is a good starting place, but certainly full text
search feeds would be nice. Not sure if it's computationally/cost
realistic though.

--
- Josh Tauberer

http://razor.occams.info

"Yields falsehood when preceded by its quotation! Yields
falsehood when preceded by its quotation!" Achilles to
Tortoise (in "Gödel, Escher, Bach" by Douglas Hofstadter)

Chris Baker

unread,

Oct 4, 2007, 11:00:34 AM10/4/07

to openhous...@googlegroups.com

Rob,

First off, thanks for asking.

Excuse me if this is something that has already been dealt with. I'm new to the existing open government projects and players, so I'm still trying to get myself up to speed. I'm coming from a perspective of someone focused on the local, on the ground aspects, of the political process and trying to bridge that to what's going on in Washington.

In my perfect world representatives from government, watchdog groups and the media would form a working group to create standards for organizing legislative data so that it is easily processed using automated tools. I see the need for standardized ontologies (OWL), vocabularies (SKOS) and RDF Schemas that don't just apply to Congress, but to the entire process of government itself.

The biggest problem I keep hearing is that people don't know what's going on... that Representatives don't know the content of the bills they are passing, and that the public doesn't know where their tax dollars are being spent... there's information overload. This to me points to as much a metadata problem as a data problem.

The data coming out needs to make it as easy as possible for people, both inside and outside the government to build tools. IMNSHO RSS is simply to vague a format. I'd like to see the data as plastic as possible, and to me that means RDF.

This raisies the following questions:

* How much duplication of concerns is there between state legislative activities and federal so that we don't have to solve the problem over and over again?
* What existing standards exist in the financial world so that budget reporting can learn from existing efforts?

On 10/2/07, Rob Pierson <pier...@gmail.com> wrote:
> * What file formats/system would you recommend? Is a complete dump

> of the entire database necessary?

No. Ideally I'd like to see a SPARQL interface.

> * How could sites be made aware of changes to the system? Rather than

> accessing every bill record every night, is there a way that sites could
> only access records that had been updated (i.e. new cosponsors, bill action, etc).

Annotating the data with RDF it should be possible to easily create interfaces that would allow users to subscribe to updates via RDF or RSS.

* Is it important that RSS feeds be made available for search terms? For example,

an RSS feed for all new bills that contain the word Iraq in the text.

Personally, RSS is useful for many, but it is not enough for easy use by automated tools. You should have to scrape and parse the text for key words. The text should be annotated using defined specifications and vocabularies.

> * What's the work around for this need today?

A lot of crude munging.

Thanks again for everyone's work. You guys are really an inspiration.

Chris Baker
http://semanticcaucus.blogspot.com/

Josh Tauberer

unread,

Oct 4, 2007, 11:22:06 AM10/4/07

to openhous...@googlegroups.com

Chris Baker wrote:
> I'd like to see the data as plastic as
> possible, and to me that means RDF.

Oh, so been there, done that!

Lately, because of the rate at which these open data things are
improving, and the fact that the LOC people that I talked to don't even
seem to have any interest in public open data, my take is that the best
hope for seeing progress is to suggest the simplest way to go forward.
That means XML, REST, etc.

(Have you see these?
http://www.govtrack.us/sparql.xpd
http://www.govtrack.us/source.xpd )

(And, btw, I take friendly issue with your blog entry that the WaPo is
leading the way in 21st century democracy with their votes database.
::grin::)

Chris Baker

unread,

Oct 4, 2007, 11:53:49 AM10/4/07

to openhous...@googlegroups.com

On 10/4/07, Josh Tauberer <taub...@govtrack.us> wrote:

Chris Baker wrote:
> I'd like to see the data as plastic as
> possible, and to me that means RDF.

Oh, so been there, done that!

Lately, because of the rate at which these open data things are
improving, and the fact that the LOC people that I talked to don't even
seem to have any interest in public open data, my take is that the best
hope for seeing progress is to suggest the simplest way to go forward.
That means XML, REST, etc.

I can certainly understand this tactically. I'm more of a Utopian idealist than a step in the right direction man. I think that there will be a demand for this data no matter what, so if legislators don't offer it up eventually outside groups will do it themselves and thus control the data feeds.

Unfortunately, it will be a hard sell until there are tools that make the case, so we're trapped in a chicken waiting for the egg waiting for the chicken holding pattern. This is why my focus is on building tools for use by people on the ground. The data set is smaller, and it's easier to apply to real world situations.

(Have you see these?
http://www.govtrack.us/sparql.xpd
http://www.govtrack.us/source.xpd )

No, and very cool!

(And, btw, I take friendly issue with your blog entry that the WaPo is
leading the way in 21st century democracy with their votes database.
::grin::)

OK... that's a fair cop 8-)

Let me quantify things. For me, one of the driving forces for technological political innovation needs to be traditional print media. As information grows out of control I see a real market for trusted non-partisan sources that can divine the semantic tea leaves. As they get more and more marginalized my hope is that they'll see this and run with it. The Washington Post is doing that.

Chris
http://semanticcaucus.blogspot.com/

Derek Willis

unread,

Oct 4, 2007, 12:17:35 PM10/4/07

to Open House Project

On Oct 4, 11:53 am, "Chris Baker" <ign...@gmail.com> wrote:

> On 10/4/07, Josh Tauberer <taube...@govtrack.us> wrote:
>
> (And, btw, I take friendly issue with your blog entry that the WaPo is
>
> > leading the way in 21st century democracy with their votes database.
> > ::grin::)
>
> OK... that's a fair cop 8-)
>
> Let me quantify things. For me, one of the driving forces for technological
> political innovation needs to be traditional print media. As information
> grows out of control I see a real market for trusted non-partisan sources
> that can divine the semantic tea leaves. As they get more and more
> marginalized my hope is that they'll see this and run with it. The
> Washington Post is doing that.
>
> Chrishttp://semanticcaucus.blogspot.com/

Well, we're *trying* to do that. But Josh got there first, and I hope
he knows that how many folks in the media appreciate his efforts.

As to the questions raised, I'd prefer an entire database dump, or at
least sections of the database that can be regularly updated, sort of
the way the Federal Election Commission does with its data. Plenty of
other things can be extended from that, including RSS and other stuff,
so if requiring the LoC to have it delays things, then I second Josh's
recommendation. Simplicity works best.

Derek Willis
washingtonpost.com

John Brothers

unread,

Oct 5, 2007, 11:14:55 AM10/5/07

to openhous...@googlegroups.com

Hello all,

   I'm fairly new to this group, but I figure I should throw my 2 cents in. I'm the CTO of the Sunlight Foundation, with long experience in open source and data processing.

   File System/Formats -

     Like Derek, I'd generally prefer everything to be in an actual database, with timestamps on records and such.   That's the ultimately flexible format that allows for us (the consumers of the data) to create feeds, to sort and group the data, and to run statistics with ease.

     SparQL looks pretty neat, if verbose and somewhat cumbersome, but I don't think it particularly buys us anything over a solid open source database infrastructure.

    Notifications

      A proper database structure would handle this easily.

    RSS Feeds

      Let third parties (Josh & GovTrack, Sunlight, etc) provide the RSS feeds, alerts, twitter interfaces and all that other stuff.   Keep it simple at the source.

    Workaround

       Sorry, I don't know the current situation, so I can't comment.


Contingency plans

We all get together and someone agrees to be the one group that pulls from the existing data, and creates the database, and the others pull from that manufactured database? Sunlight, for example, certainly has the resources to do this, if necessary.

--
CTO @ SunlightFoundation.com - 678 467 3504
Agile Development Blog: IndefiniteArticles.com
Stone Magic: Stonemagic.Picobusiness.com

james a. jacobs

unread,

Oct 5, 2007, 1:27:24 PM10/5/07

to openhous...@googlegroups.com

as we think about these issues, there are two things that i think are
useful to consider. these are things i have found in over 20 years of
dealing with legacy data, legacy software, legacy formats and trying
to use data today that was created years ago:

1. (and this has been mentioned here before) "simple" gets
implemented and "complex" does not. that's an oversimplification, of
course, but we've seen examples: html being so simple that it
enabled the web, but with rdf being complex and much slower to create
the semantic web; the government's "GILS" (Government Information
Locator Service) being well-thought out, but mostly un-implemented; etc.

2. separation of data from applications is *always* better for
preservation purposes. when agencies instantiate their information in
databases it is just too easy for them to leave out data that doesn't
fit the database design and too tempting to use software-specific
functionality that gets lost in translation to any other system.

this leads me only to generic suggestions, not specific ones:

a. software-neutral and OS-neutral formats for distribution and
preservation (i.e., xml)

b. "minimal-level" rules for mark-up and metadata to help ensure that
some core information *always* gets produced and saved -- even if it
is *possible* to produce much more complex and demanding and
expensive information.

c. flexibility: standards should allow for complete, comprehensive
markup without limitations (field size, character-encoding, etc.) and
should allow for change over time as our needs change.

James A. Jacobs
Data Services Librarian Emeritus
University of California San Diego

Michael Stern

unread,

Oct 5, 2007, 1:42:32 PM10/5/07

to openhous...@googlegroups.com

You may be interested in this article on beltway blogging. David All is mentioned.

Beltway Blogroll: National Journal's Cover Story On Blogs

Mike Stern

John Wonderlich

unread,

Oct 10, 2007, 1:13:10 AM10/10/07

to openhous...@googlegroups.com

To what degree should legislative metadata standards coordination efforts (ie, re-envisioning THOMAS with enhanced functionality and public database access)... To what degree should this take into account existing archival metadata standards?

I've been looking through this LOC website for the Encoded Archival Description, which looks like a set of DTDs and schemas for archiving. In what ways standardizing legislative information formats different from standardizing archive descriptions? In what ways in THOMAS different from an archive?

I expect we could gain something from understanding the development of this project, well described here, or perhaps also be aware of the sort of standardizations discussions LOC is having publicly on a listserv here.

Are there many similar specialized discussions with relevance to this issue? GODORT is the only other I'm aware of (oh, and GOVDOC-L.... which is really fascinating because the conversations are archived on a google group since 1991, making for some interesting discussions about issues like adjusting to CDs as storage devices, or imagining how government will be affected by the then nascent Internet.

John

--
John Wonderlich

Program Director
The Sunlight Foundation
(202) 742-1520 ext. 234

Sarah

unread,

Oct 10, 2007, 12:12:35 PM10/10/07

to Open House Project

- What file formats/system would you recommend? Is a complete dump of
the entire database necessary?

-A web service with XML.

- How could sites be made aware of changes to the system? Rather

than
accessing every bill record every night, is there a way that sites
could
only access records that had been updated (i.e. new cosponsors,
bill
action, etc).

-They don't need to be made aware of changes, they just need to
develop a system for re-checking the web service and identifying new
content.
Queries don't need to be babysat--- if only some records were
available at any time, there would enevitably be a time when you'd
need the old ones that
were no longer available.

- Is it important that RSS feeds be made available for search

terms?
For example, an RSS feed for all new bills that contain the word
Iraq in the
text.

Not really--I can understand why a small slice of the population
would want it, but I don't think it's really needed by the broader
population. If you have a web
service, you have everything, and you can tease out terms,
simplifying information for others.

- What's the work around for this need today?

There's a couple of good web services created by a few states
already. They made it in house and are willing to share it freely
with others. LoC doesn't need to pay out a bunch of money for this---
it's freely available to them now, and they just need to make a few
changes to suit their needs.

Rob Pierson

unread,

Oct 11, 2007, 5:50:37 PM10/11/07

to openhous...@googlegroups.com

Thanks for all of the excellent technical recommendations everyone.

I've been told that folks at LoC are following our conversations here and have been finding them quite useful. Perhaps the next step in developing a recommendation on the community's technical and functional requirements would be a conference call and then collaboration on a google doc?

I'm also looking at holding a discussion with Congressional staff about what changes they would like to see in LIS and Thomas. I'd invite LIS and Thomas staff to brief offices on their plans for the future and we could then discuss what features staff would like to see. I've already spoken to staffers who want to be able to display cosponsors and other bill data through some sort of official web service / api, and providing a forum for those requests could help make that a reality.

Sarah raised a great point, and I was also hoping we could point out to LoC some concrete examples of legislative databases that were implemented in a really useful way. Do any state or foreign governments have particularly good implementations of web services and/or ways of making their raw legislative database available?

sim...@gmail.com

unread,

Oct 14, 2007, 7:18:02 PM10/14/07

to Open House Project

UW ITS http://www.its.washington.edu/ does a great job of serving data
for a variety of DOT systems in the NorthWest. One of the cooler
applications created from the data is the now defunct bus monster..
http://www.busmonster.com/

Reply all

Reply to author

Forward