Senate lobbyist database

Drew

unread,

Jul 5, 2008, 10:29:40 PM7/5/08

to Watchdog Volunteers

I spent a few hours looking at the Senate LDA lobbyist database and
made some notes, which I've posted here:

http://watchdog.jottit.com/lobbying_database

There are a few missing sections. I'll fill those in soon, but I need
to do more analysis first.

I haven't written a parser yet, but this page should be useful for
anyone who does. (David, are you still working on a parser for this
database?) Please speak up if you have questions or suggestions
regarding solutions to the problems or open questions listed on the
wiki.

With respect to the inconsistent naming and subsidiaries problems, I
think it's probably a better tradeoff to patch them up during report
generation, rather than trying to normalize them during parsing or
database import. It'll make it easier to compare the Watchdog database
with the Senate XML documents, at the very least.

Aaron Swartz

unread,

Jul 6, 2008, 9:57:42 AM7/6/08

to watchdog-...@googlegroups.com

> http://watchdog.jottit.com/lobbying_database

This is amazing work, Drew. This is just the kind of analysis we need.
I'm traveling at the moment, but I hope either you or you working with
someone can turn this into a parser.

Drew

unread,

Jul 6, 2008, 10:28:54 PM7/6/08

to watchdog-...@googlegroups.com

On Sun, Jul 6, 2008 at 6:57 AM, Aaron Swartz <m...@aaronsw.com> wrote:

> I'm traveling at the moment, but I hope either you or you working with
> someone can turn this into a parser.

Thanks, I'm glad you think it's useful.

David's already working on a parser, so I'll wait a few days to hear from
him before I write any code. In any case, if he doesn't have time to
finish the parser or incorporate the things listed on the wiki, I can do
it.

John Osborne

unread,

Jul 7, 2008, 9:54:03 AM7/7/08

to watchdog-...@googlegroups.com

I'd thrown together a crude parser, that probably could use some
improvement a week ago or so. David or anyone else is free to use
it...
It's at http://jro.freeshell.org/lobbyist.py

--
John Osborne
osbo...@ieee.org/osbo...@gmail.com/j...@freeshell.org

Drew

unread,

Jul 23, 2008, 6:11:11 PM7/23/08

to Watchdog Volunteers

The lobbying regulations require lobbying organizations to file
termination reports under a variety of conditions: when a client
terminates their services, when an individual lobbyist leaves a parent
organization, when a client merges with another client, etc. Between
the initial registration filing and the termination report, there's
enough information in the database to determine who was lobbying for
whom and for how long. Conversely, absence of a termination report
indicates that the relationship still exists.

Is this kind of information interesting to the Watchdog Project? The
accounting is tricky in some cases, and it means adding a new table(s)
to the lobbying database, but the information is there, if it's
useful.

Aaron Swartz

unread,

Jul 23, 2008, 6:15:14 PM7/23/08

to watchdog-...@googlegroups.com

I think it'd be useful; especially for establishing revolving door
stuff (e.g. Joe works for a Congressman, leaves to become a lobbyist,
leaves to go work for a Congressman again).

Jeremy Dunck

unread,

Jul 23, 2008, 6:35:03 PM7/23/08

to watchdog-...@googlegroups.com

On Wed, Jul 23, 2008 at 5:11 PM, Drew <drew...@gmail.com> wrote:
>
> The lobbying regulations require lobbying organizations to file
> termination reports under a variety of conditions: when a client
> terminates their services, when an individual lobbyist leaves a parent
> organization, when a client merges with another client, etc. Between
> the initial registration filing and the termination report, there's
> enough information in the database to determine who was lobbying for
> whom and for how long. Conversely, absence of a termination report
> indicates that the relationship still exists.

I've confirmed with David Brown that he's not working on it any more.

Drew, any interest in collaborating on this? I've been working on it
a bit myself. Maybe we could get on IRC to make sure we're not
stepping on each other...

I just edited this page to reflect the status on it:
http://watchdog.jottit.com/volunteer

Anyway, I have basic XML -> dict parsing done, but that's not
particularly useful.

I was reading up on normalized edit distance to try to deal with
different names for the same entity (General Electric vs GE vs General
Electric, Inc.).

I recently edited this page with a bit more info:
http://watchdog.jottit.com/lobbying_database

Drew

unread,

Jul 23, 2008, 11:20:40 PM7/23/08

to watchdog-...@googlegroups.com

On Wed, Jul 23, 2008 at 3:35 PM, Jeremy Dunck <jdu...@gmail.com> wrote:

> Drew, any interest in collaborating on this? I've been working on it
> a bit myself. Maybe we could get on IRC to make sure we're not
> stepping on each other...
>
> I just edited this page to reflect the status on it:
> http://watchdog.jottit.com/volunteer
>
> Anyway, I have basic XML -> dict parsing done, but that's not
> particularly useful.

Hi Jeremy,

All I've done so far is read a bunch of things that explain how the
lobbying regulations are supposed to work, poke around in the XML
documents to figure out how to interpret the data, and try to document
it all on the wiki. I've written some code to help me understand some
of the special cases in the database, e.g., to catalog the 80+ record
types that appear in the entire database from 1999-2008, but I haven't
started writing a parser, proper.

Once someone understands how filing amendments work, writing the
parser isn't particularly difficult, I think. I don't expect it to be
more than about 100 lines of code. So in my opinion, it's probably
easier for just one person do write it than to split it up. You're
welcome to code up the parser, if you like; if not, I can do
it. Whoever else is interested could do a code review or something.

The importer could be a bit tricky, so there might be an opportunity
for one of us to work on that while the other does the parser. It
depends on how we store these filing records in the Watchdog database,
I think. I'll think about it and post more later.

In any case, I can hang out on irc to answer questions. Is there a
Watchdog channel somewhere?

> I was reading up on normalized edit distance to try to deal with
> different names for the same entity (General Electric vs GE vs General
> Electric, Inc.).

I think this should be done on the report generation end of things,
not during parsing or import. Report generation is going to need a lot
of special-casing to deal with subsidiaries, conglomerates, corporate
mergers and acquisitions, name changes (e.g. Apple Computer -> Apple,
Inc.) and the like anyway; and the fewer things we modify during
import, the easier it'll be to audit the Watchdog database using the
original Senate XML documents.

> I recently edited this page with a bit more info:
> http://watchdog.jottit.com/lobbying_database

I made a bunch of updates last night, too. I found a great resource on
the House of Representatives website that answers many of the
questions I had about how to interpret the data in the LDA
database. I've got about 4 more things to add to the wiki, and then I
think we'll have enough information to write a sensible parser. I'll
try to make those final edits tonight.

Drew

unread,

Jul 23, 2008, 11:36:50 PM7/23/08

to watchdog-...@googlegroups.com

Oh, and one more thing. I think we'll end up with at least 3 parsers
for the lobbying information: the one we've been talking about to
track lobbying expenditures, one for tracking who's hiring whom, and
one for the new LD203 form, a semi-annual filing that tracks lobbyists
contributions to federal officials. (See
http://watchdog.jottit.com/lobbying_database for details.)

So there'll be plenty of parsers to go around :)

Aaron Swartz

unread,

Jul 24, 2008, 1:08:59 AM7/24/08

to watchdog-...@googlegroups.com

> In any case, I can hang out on irc to answer questions. Is there a
> Watchdog channel somewhere?

I just registered #watchdog on Freenode in case anyone is interested.

>> I was reading up on normalized edit distance to try to deal with
>> different names for the same entity (General Electric vs GE vs General
>> Electric, Inc.).
>
> I think this should be done on the report generation end of things,
> not during parsing or import. Report generation is going to need a lot
> of special-casing to deal with subsidiaries, conglomerates, corporate
> mergers and acquisitions, name changes (e.g. Apple Computer -> Apple,
> Inc.) and the like anyway; and the fewer things we modify during
> import, the easier it'll be to audit the Watchdog database using the
> original Senate XML documents.

I'm not sure what this means. Audit how? It seems like doing the
merging during load is better so that display can be done with fewer
queries.

Drew

unread,

Jul 24, 2008, 1:52:21 AM7/24/08

to watchdog-...@googlegroups.com

On Wed, Jul 23, 2008 at 10:08 PM, Aaron Swartz <m...@aaronsw.com> wrote:
>> I think this should be done on the report generation end of things,
>> not during parsing or import. Report generation is going to need a lot
>> of special-casing to deal with subsidiaries, conglomerates, corporate
>> mergers and acquisitions, name changes (e.g. Apple Computer -> Apple,
>> Inc.) and the like anyway; and the fewer things we modify during
>> import, the easier it'll be to audit the Watchdog database using the
>> original Senate XML documents.
>
> I'm not sure what this means. Audit how? It seems like doing the
> merging during load is better so that display can be done with fewer
> queries.

Sorry I was unclear. I mean auditing by comparing records in the
Watchdog database to records in the source (i.e., the original XML
documents published by the Senate), in case there's a question about
the accuracy of the Watchdog data. If you're not concerned about that,
then you're probably right that name normalization should be done
during import.

Drew

unread,

Jul 24, 2008, 6:15:29 AM7/24/08

to watchdog-...@googlegroups.com

On Wed, Jul 23, 2008 at 8:20 PM, Drew <drew...@gmail.com> wrote:

> I made a bunch of updates last night, too. I found a great resource on
> the House of Representatives website that answers many of the
> questions I had about how to interpret the data in the LDA
> database. I've got about 4 more things to add to the wiki, and then I
> think we'll have enough information to write a sensible parser. I'll
> try to make those final edits tonight.

I wasn't able to finish all of the edits on the wiki tonight, but I did update
a few things. I'll keep working on it tomorrow.

John Osborne

unread,

Jul 24, 2008, 9:24:07 AM7/24/08

to watchdog-...@googlegroups.com

I'd thrown together a rough parser for those XML files a few weeks
back, sent an email to in...@watchdog.net about it.
If you want to look a the code I have let me know...

--
John Osborne
osbo...@ieee.org/osbo...@gmail.com/j...@freeshell.org

Aaron Swartz

unread,

Jul 24, 2008, 1:07:27 PM7/24/08

to watchdog-...@googlegroups.com

I believe the URL oyu sent was http://jro.freeshell.org/lobbyist.py

Aaron Swartz

unread,

Jul 31, 2008, 2:27:40 PM7/31/08

to watchdog-...@googlegroups.com

How's this going, Drew?

Drew

unread,

Aug 1, 2008, 2:39:55 PM8/1/08

to watchdog-...@googlegroups.com

On Thu, Jul 31, 2008 at 11:27 AM, Aaron Swartz <m...@aaronsw.com> wrote:
>
> How's this going, Drew?

Sorry for the recent lack of updates. I added new info to the wiki
page on Sunday, but I haven't done anything since. I'm still committed
to delivering a parser, though.

I've figured out almost everything we need to make sense of the
database, but there are some records that don't make sense. For
example, here's part of a particular registration record:

It says that Katie Neal is a lobbyist who previously worked for the
government in a covered position, who didn't previously work for the
government in a covered position, and whose covered position status is
unknown.

There are other records that have similar discrepancies, and not just
in terms of lobbyists' status. I'm not sure what to do about
them. I'll try contacting the SOPR with some examples to see if they
can explain them.

Drew

unread,

Aug 16, 2008, 6:27:54 AM8/16/08

to watchdog-...@googlegroups.com

I've been traveling the last 2 weeks, so I haven't had a lot of time
to work on Watchdog projects, but I have started writing a parser for
the Senate LDA database. This particular parser is designed to help me
find problems in the database: it'll import the raw records into an
SQLite database so that I can analyze them using SQL queries (e.g.,
find all records that don't have a registrant, a situation that should
be impossible, but somehow does happen). Once it's done, though, it
should be straightforward to adapt the parser to the Watchdog
database.

It's a work in progress, but the GitHub repo is here:
http://tinyurl.com/5vpofz

Suggestions or help are welcome!

Aaron Swartz

unread,

Sep 19, 2008, 5:02:28 PM9/19/08

to watchdog-...@googlegroups.com

How's this going, Drew? Looks like you're making study progress, to
judge from the commit logs. Anything we can help with?

Drew

unread,

Sep 26, 2008, 7:53:58 AM9/26/08

to watchdog-...@googlegroups.com

On Fri, Sep 19, 2008 at 2:02 PM, Aaron Swartz <m...@aaronsw.com> wrote:
>
> How's this going, Drew? Looks like you're making study progress, to
> judge from the commit logs. Anything we can help with?

There are only 3 more elements left to parse -- foreign orgs,
affiliated orgs and lobbying issues -- and then I'm finished with this
parser. As is, it can load into an sqlite3 database the complete set
of Senate LD-1/LD-2 disclosure data from 1999 until present, save for
the 3 missing element types. It takes about 50 minutes to parse and
import the entire history on a 2.4GHz MacBook Pro. (That execution
time will get slightly worse after I add the remaining 3 element
types.)

I haven't spent any time on the project in the last week, and won't be
able to resume the work until Monday, but I'll try to wrap it all up
by next weekend. The remaining work is easy.

Because this parser/importer is independent of any particular
"transparency" project and is intended to be used as a sort of
reference code base, it'll take some effort to make it ready for the
Watchdog Project, but the work should be relatively
straightforward. The only non-trivial bits that come to mind are
identifying all the common names for organizations like General
Electric, which will be tedious, and handling amendment records (see
the wiki for my thoughts on how those should be processed).

All of the code I've written to date for the reference parser is
complete, stable and tested, so if someone's looking for a task, he
could start integrating the reference code into Watchdog at any
time. That would at least get Watchdog started with the basic lobbying
information: what org hired which lobbying org(s), when and for how
much.

Nearly all of the records, so far, look reasonable, except for the
individual lobbyist data. It's a mess, and I don't trust it in many
cases. Another potential project for someone to work on is to parse
individual lobbyist data out of the House XML documents and merge it
with the Senate records. The House records from dates prior to around
2007 lack much of the interesting detail contained in their
corresponding Senate records, but the House records do appear to have
sane individual lobbyist information, at least.

Speaking of which, recent House records appear to be *more* detailed
than the Senate records, so it might also be a worthwhile project to
write a complete parser for the House database and use it for all
future records.

A couple of fields here and there are also suspect, but they're
generally redundant and/or not particularly useful or
interesting. These cases are documented in the unit tests, and I'll
write them up in the wiki, too, once I'm finished with the code.

Reply all

Reply to author

Forward