As the Library of Congress (i.e. thomas/LIS) completes their work converting the legislative summaries into XML, they are doing research into what system the new legislative database might be made available. I'm going to be working on ensuring that non-hill folks have access to a bill search system that is as capable as what is available to congressional staff, but for the moment I'd like to address the question of the raw data.
Rather than waiting for LoC to produce a proposal of how that legislative data should be made available, I think it makes sense for this group to preemptively offer ideas about the way that raw legislative data should be provided for repurposing on other websites. We should consider the needs of sites which will repurpose the data, but at the same time the database format recommended by this group must minimize the webserver impact of sites like Govtrack.
Questions:
(please answer other important questions even if I don't know that I should pose them) :)
- What file formats/system would you recommend? Is a complete dump of the entire database necessary?
- How could sites be made aware of changes to the system? Rather than accessing every bill record every night, is there a way that sites could only access records that had been updated (i.e. new cosponsors, bill action, etc).
- Is it important that RSS feeds be made available for search terms? For example, an RSS feed for all new bills that contain the word Iraq in the text.
Rob Pierson wrote: > * What file formats/system would you recommend?
HTTP REST-based (i.e. GET) API for getting records, or just HTTP/FTP access to files directly. No SOAP, web services, or whatever. Simple simple simple.
> Is a complete dump
> of the entire database necessary?
It would be a good idea, for sure. Considering THOMAS has records for somewhere in the ballpark of 200,000 bills (probably around 1GB of data, based on my own database), if you want all of it, no one is going to be happy with 200,000 uncompressed HTTP requests (esp. at their current maxmimum permitted rate of one per second). If you're trying to get a new project going, you might want the whole database.
> * How could sites be made aware of changes to the system? Rather > than accessing every bill record every night, is there a way that > sites could only access records that had been updated (i.e. new > cosponsors, bill action, etc).
That's an absolute must. That's one of the biggest problems I have with GovTrack. Not all bill updates are reflected in the Daily Digest, and there's no other way to get a list of changed bills. (The D.D. is also not machine readable...)
That could be done simply by updating a file with the last modified time of each record any time a record is modified, or by making a dynamic page that gives all modified records within a given time frame. (Critically, these pages should at the very least cover 7 days of changes in one request and not require paging through 1-50, 51-100, etc. That's so annoying.) This *could* be done in RSS, which would sort of make use of standard date formats and things, so long as it refers to records unambiguously, and that might give it a dual use for individuals. But, that might be unnecessary.
> * Is it important that RSS feeds be made available for search terms? > For example, an RSS feed for all new bills that contain the word > Iraq in the text.
This shouldn't be a point that slows down anything else. RSS feeds by LIV terms (as I do) is a good starting place, but certainly full text search feeds would be nice. Not sure if it's computationally/cost realistic though.
"Yields falsehood when preceded by its quotation! Yields falsehood when preceded by its quotation!" Achilles to Tortoise (in "Gödel, Escher, Bach" by Douglas Hofstadter)
Excuse me if this is something that has already been dealt with. I'm new to the existing open government projects and players, so I'm still trying to get myself up to speed. I'm coming from a perspective of someone focused on the local, on the ground aspects, of the political process and trying to bridge that to what's going on in Washington.
In my perfect world representatives from government, watchdog groups and the media would form a working group to create standards for organizing legislative data so that it is easily processed using automated tools. I see the need for standardized ontologies (OWL), vocabularies (SKOS) and RDF Schemas that don't just apply to Congress, but to the entire process of government itself.
The biggest problem I keep hearing is that people don't know what's going on... that Representatives don't know the content of the bills they are passing, and that the public doesn't know where their tax dollars are being spent... there's information overload. This to me points to as much a metadata problem as a data problem.
The data coming out needs to make it as easy as possible for people, both inside and outside the government to build tools. IMNSHO RSS is simply to vague a format. I'd like to see the data as plastic as possible, and to me that means RDF.
This raisies the following questions:
* How much duplication of concerns is there between state legislative activities and federal so that we don't have to solve the problem over and over again? * What existing standards exist in the financial world so that budget reporting can learn from existing efforts?
On 10/2/07, Rob Pierson <piers...@gmail.com> wrote:
> * What file formats/system would you recommend? Is a complete dump > of the entire database necessary?
No. Ideally I'd like to see a SPARQL interface.
> * How could sites be made aware of changes to the system? Rather than > accessing every bill record every night, is there a way that sites could > only access records that had been updated (i.e. new cosponsors, bill
action, etc).
Annotating the data with RDF it should be possible to easily create interfaces that would allow users to subscribe to updates via RDF or RSS.
* Is it important that RSS feeds be made available for search terms? For example, an RSS feed for all new bills that contain the word Iraq in the text.
Personally, RSS is useful for many, but it is not enough for easy use by automated tools. You should have to scrape and parse the text for key words. The text should be annotated using defined specifications and vocabularies.
> * What's the work around for this need today?
A lot of crude munging.
Thanks again for everyone's work. You guys are really an inspiration.
Chris Baker wrote: > I'd like to see the data as plastic as > possible, and to me that means RDF.
Oh, so been there, done that!
Lately, because of the rate at which these open data things are improving, and the fact that the LOC people that I talked to don't even seem to have any interest in public open data, my take is that the best hope for seeing progress is to suggest the simplest way to go forward. That means XML, REST, etc.
"Yields falsehood when preceded by its quotation! Yields falsehood when preceded by its quotation!" Achilles to Tortoise (in "Gödel, Escher, Bach" by Douglas Hofstadter)
On 10/4/07, Josh Tauberer <taube...@govtrack.us> wrote:
> Chris Baker wrote: > > I'd like to see the data as plastic as > > possible, and to me that means RDF.
> Oh, so been there, done that!
> Lately, because of the rate at which these open data things are > improving, and the fact that the LOC people that I talked to don't even > seem to have any interest in public open data, my take is that the best > hope for seeing progress is to suggest the simplest way to go forward. > That means XML, REST, etc.
I can certainly understand this tactically. I'm more of a Utopian idealist than a step in the right direction man. I think that there will be a demand for this data no matter what, so if legislators don't offer it up eventually outside groups will do it themselves and thus control the data feeds.
Unfortunately, it will be a hard sell until there are tools that make the case, so we're trapped in a chicken waiting for the egg waiting for the chicken holding pattern. This is why my focus is on building tools for use by people on the ground. The data set is smaller, and it's easier to apply to real world situations.
(And, btw, I take friendly issue with your blog entry that the WaPo is
> leading the way in 21st century democracy with their votes database. > ::grin::)
OK... that's a fair cop 8-)
Let me quantify things. For me, one of the driving forces for technological political innovation needs to be traditional print media. As information grows out of control I see a real market for trusted non-partisan sources that can divine the semantic tea leaves. As they get more and more marginalized my hope is that they'll see this and run with it. The Washington Post is doing that.
> On 10/4/07, Josh Tauberer <taube...@govtrack.us> wrote:
> (And, btw, I take friendly issue with your blog entry that the WaPo is
> > leading the way in 21st century democracy with their votes database. > > ::grin::)
> OK... that's a fair cop 8-)
> Let me quantify things. For me, one of the driving forces for technological > political innovation needs to be traditional print media. As information > grows out of control I see a real market for trusted non-partisan sources > that can divine the semantic tea leaves. As they get more and more > marginalized my hope is that they'll see this and run with it. The > Washington Post is doing that.
Well, we're *trying* to do that. But Josh got there first, and I hope he knows that how many folks in the media appreciate his efforts.
As to the questions raised, I'd prefer an entire database dump, or at least sections of the database that can be regularly updated, sort of the way the Federal Election Commission does with its data. Plenty of other things can be extended from that, including RSS and other stuff, so if requiring the LoC to have it delays things, then I second Josh's recommendation. Simplicity works best.
I'm fairly new to this group, but I figure I should throw my 2 cents in. I'm the CTO of the Sunlight Foundation, with long experience in open source and data processing.
File System/Formats -
Like Derek, I'd generally prefer everything to be in an actual database, with timestamps on records and such. That's the ultimately flexible format that allows for us (the consumers of the data) to create feeds, to sort and group the data, and to run statistics with ease.
SparQL looks pretty neat, if verbose and somewhat cumbersome, but I don't think it particularly buys us anything over a solid open source database infrastructure.
Notifications
A proper database structure would handle this easily.
RSS Feeds
Let third parties (Josh & GovTrack, Sunlight, etc) provide the RSS feeds, alerts, twitter interfaces and all that other stuff. Keep it simple at the source.
Workaround
Sorry, I don't know the current situation, so I can't comment.
Contingency plans
- We all get together and someone agrees to be the one group that pulls from the existing data, and creates the database, and the others pull from that manufactured database? Sunlight, for example, certainly has the resources to do this, if necessary.
-- CTO @ SunlightFoundation.com - 678 467 3504 Agile Development Blog: IndefiniteArticles.com Stone Magic: Stonemagic.Picobusiness.com
as we think about these issues, there are two things that i think are useful to consider. these are things i have found in over 20 years of dealing with legacy data, legacy software, legacy formats and trying to use data today that was created years ago:
1. (and this has been mentioned here before) "simple" gets implemented and "complex" does not. that's an oversimplification, of course, but we've seen examples: html being so simple that it enabled the web, but with rdf being complex and much slower to create the semantic web; the government's "GILS" (Government Information Locator Service) being well-thought out, but mostly un-implemented; etc.
2. separation of data from applications is *always* better for preservation purposes. when agencies instantiate their information in databases it is just too easy for them to leave out data that doesn't fit the database design and too tempting to use software-specific functionality that gets lost in translation to any other system.
this leads me only to generic suggestions, not specific ones:
a. software-neutral and OS-neutral formats for distribution and preservation (i.e., xml)
b. "minimal-level" rules for mark-up and metadata to help ensure that some core information *always* gets produced and saved -- even if it is *possible* to produce much more complex and demanding and expensive information.
c. flexibility: standards should allow for complete, comprehensive markup without limitations (field size, character-encoding, etc.) and should allow for change over time as our needs change.
James A. Jacobs Data Services Librarian Emeritus University of California San Diego
To what degree should legislative metadata standards coordination efforts (ie, re-envisioning THOMAS with enhanced functionality and public database access)... To what degree should this take into account existing archivalmetadata standards?
I've been looking through this LOC website for the Encoded Archival Description <http://www.loc.gov/ead/>, which looks like a set of DTDs and schemas for archiving. In what ways standardizing legislative information formats different from standardizing archive descriptions? In what ways in THOMAS different from an archive?
Are there many similar specialized discussions with relevance to this issue? GODORT <http://www.ala.org/ala/godort/godort.htm> is the only other I'm aware of (oh, and GOVDOC-L <http://govdoc-l.org/>.... which is really fascinating because the conversations are archived on a google group since 1991, making for some interesting discussions about issues like adjusting to CDs as storage devices, or imagining how government will be affected by the then nascent Internet.
John
On 10/5/07, james a. jacobs <jamesajac...@mac.com> wrote:
> as we think about these issues, there are two things that i think are > useful to consider. these are things i have found in over 20 years of > dealing with legacy data, legacy software, legacy formats and trying > to use data today that was created years ago:
> 1. (and this has been mentioned here before) "simple" gets > implemented and "complex" does not. that's an oversimplification, of > course, but we've seen examples: html being so simple that it > enabled the web, but with rdf being complex and much slower to create > the semantic web; the government's "GILS" (Government Information > Locator Service) being well-thought out, but mostly un-implemented; etc.
> 2. separation of data from applications is *always* better for > preservation purposes. when agencies instantiate their information in > databases it is just too easy for them to leave out data that doesn't > fit the database design and too tempting to use software-specific > functionality that gets lost in translation to any other system.
> this leads me only to generic suggestions, not specific ones:
> a. software-neutral and OS-neutral formats for distribution and > preservation (i.e., xml)
> b. "minimal-level" rules for mark-up and metadata to help ensure that > some core information *always* gets produced and saved -- even if it > is *possible* to produce much more complex and demanding and > expensive information.
> c. flexibility: standards should allow for complete, comprehensive > markup without limitations (field size, character-encoding, etc.) and > should allow for change over time as our needs change.
> James A. Jacobs > Data Services Librarian Emeritus > University of California San Diego
-- John Wonderlich
Program Director The Sunlight Foundation (202) 742-1520 ext. 234
- What file formats/system would you recommend? Is a complete dump of the entire database necessary?
-A web service with XML.
- How could sites be made aware of changes to the system? Rather than accessing every bill record every night, is there a way that sites could only access records that had been updated (i.e. new cosponsors, bill action, etc).
-They don't need to be made aware of changes, they just need to develop a system for re-checking the web service and identifying new content. Queries don't need to be babysat--- if only some records were available at any time, there would enevitably be a time when you'd need the old ones that were no longer available.
- Is it important that RSS feeds be made available for search terms? For example, an RSS feed for all new bills that contain the word Iraq in the text.
Not really--I can understand why a small slice of the population would want it, but I don't think it's really needed by the broader population. If you have a web service, you have everything, and you can tease out terms, simplifying information for others.
- What's the work around for this need today?
There's a couple of good web services created by a few states already. They made it in house and are willing to share it freely with others. LoC doesn't need to pay out a bunch of money for this--- it's freely available to them now, and they just need to make a few changes to suit their needs.
Thanks for all of the excellent technical recommendations everyone.
I've been told that folks at LoC are following our conversations here and have been finding them quite useful. Perhaps the next step in developing a recommendation on the community's technical and functional requirements would be a conference call and then collaboration on a google doc?
I'm also looking at holding a discussion with Congressional staff about what changes they would like to see in LIS and Thomas. I'd invite LIS and Thomas staff to brief offices on their plans for the future and we could then discuss what features staff would like to see. I've already spoken to staffers who want to be able to display cosponsors and other bill data through some sort of official web service / api, and providing a forum for those requests could help make that a reality.
Sarah raised a great point, and I was also hoping we could point out to LoC some concrete examples of legislative databases that were implemented in a really useful way. Do any state or foreign governments have particularly good implementations of web services and/or ways of making their raw legislative database available?
> - What file formats/system would you recommend? Is a complete dump of > the entire database necessary?
> -A web service with XML.
> - How could sites be made aware of changes to the system? Rather > than > accessing every bill record every night, is there a way that sites > could > only access records that had been updated (i.e. new cosponsors, > bill > action, etc).
> -They don't need to be made aware of changes, they just need to > develop a system for re-checking the web service and identifying new > content. > Queries don't need to be babysat--- if only some records were > available at any time, there would enevitably be a time when you'd > need the old ones that > were no longer available.
> - Is it important that RSS feeds be made available for search > terms? > For example, an RSS feed for all new bills that contain the word > Iraq in the > text.
> Not really--I can understand why a small slice of the population > would want it, but I don't think it's really needed by the broader > population. If you have a web > service, you have everything, and you can tease out terms, > simplifying information for others.
> - What's the work around for this need today?
> There's a couple of good web services created by a few states > already. They made it in house and are willing to share it freely > with others. LoC doesn't need to pay out a bunch of money for this--- > it's freely available to them now, and they just need to make a few > changes to suit their needs.
UW ITS http://www.its.washington.edu/ does a great job of serving data for a variety of DOT systems in the NorthWest. One of the cooler applications created from the data is the now defunct bus monster.. http://www.busmonster.com/
On Oct 11, 2:50 pm, "Rob Pierson" <piers...@gmail.com> wrote:
> Do any state or foreign governments have particularly > good implementations of web services and/or ways of making their raw > legislative database available?