This is amazing work, Drew. This is just the kind of analysis we need.
I'm traveling at the moment, but I hope either you or you working with
someone can turn this into a parser.
> I'm traveling at the moment, but I hope either you or you working with
> someone can turn this into a parser.
Thanks, I'm glad you think it's useful.
David's already working on a parser, so I'll wait a few days to hear from
him before I write any code. In any case, if he doesn't have time to
finish the parser or incorporate the things listed on the wiki, I can do
it.
--
John Osborne
osbo...@ieee.org/osbo...@gmail.com/j...@freeshell.org
Is this kind of information interesting to the Watchdog Project? The
accounting is tricky in some cases, and it means adding a new table(s)
to the lobbying database, but the information is there, if it's
useful.
I've confirmed with David Brown that he's not working on it any more.
Drew, any interest in collaborating on this? I've been working on it
a bit myself. Maybe we could get on IRC to make sure we're not
stepping on each other...
I just edited this page to reflect the status on it:
http://watchdog.jottit.com/volunteer
Anyway, I have basic XML -> dict parsing done, but that's not
particularly useful.
I was reading up on normalized edit distance to try to deal with
different names for the same entity (General Electric vs GE vs General
Electric, Inc.).
I recently edited this page with a bit more info:
http://watchdog.jottit.com/lobbying_database
> Drew, any interest in collaborating on this? I've been working on it
> a bit myself. Maybe we could get on IRC to make sure we're not
> stepping on each other...
>
> I just edited this page to reflect the status on it:
> http://watchdog.jottit.com/volunteer
>
> Anyway, I have basic XML -> dict parsing done, but that's not
> particularly useful.
Hi Jeremy,
All I've done so far is read a bunch of things that explain how the
lobbying regulations are supposed to work, poke around in the XML
documents to figure out how to interpret the data, and try to document
it all on the wiki. I've written some code to help me understand some
of the special cases in the database, e.g., to catalog the 80+ record
types that appear in the entire database from 1999-2008, but I haven't
started writing a parser, proper.
Once someone understands how filing amendments work, writing the
parser isn't particularly difficult, I think. I don't expect it to be
more than about 100 lines of code. So in my opinion, it's probably
easier for just one person do write it than to split it up. You're
welcome to code up the parser, if you like; if not, I can do
it. Whoever else is interested could do a code review or something.
The importer could be a bit tricky, so there might be an opportunity
for one of us to work on that while the other does the parser. It
depends on how we store these filing records in the Watchdog database,
I think. I'll think about it and post more later.
In any case, I can hang out on irc to answer questions. Is there a
Watchdog channel somewhere?
> I was reading up on normalized edit distance to try to deal with
> different names for the same entity (General Electric vs GE vs General
> Electric, Inc.).
I think this should be done on the report generation end of things,
not during parsing or import. Report generation is going to need a lot
of special-casing to deal with subsidiaries, conglomerates, corporate
mergers and acquisitions, name changes (e.g. Apple Computer -> Apple,
Inc.) and the like anyway; and the fewer things we modify during
import, the easier it'll be to audit the Watchdog database using the
original Senate XML documents.
> I recently edited this page with a bit more info:
> http://watchdog.jottit.com/lobbying_database
I made a bunch of updates last night, too. I found a great resource on
the House of Representatives website that answers many of the
questions I had about how to interpret the data in the LDA
database. I've got about 4 more things to add to the wiki, and then I
think we'll have enough information to write a sensible parser. I'll
try to make those final edits tonight.
So there'll be plenty of parsers to go around :)
I just registered #watchdog on Freenode in case anyone is interested.
>> I was reading up on normalized edit distance to try to deal with
>> different names for the same entity (General Electric vs GE vs General
>> Electric, Inc.).
>
> I think this should be done on the report generation end of things,
> not during parsing or import. Report generation is going to need a lot
> of special-casing to deal with subsidiaries, conglomerates, corporate
> mergers and acquisitions, name changes (e.g. Apple Computer -> Apple,
> Inc.) and the like anyway; and the fewer things we modify during
> import, the easier it'll be to audit the Watchdog database using the
> original Senate XML documents.
I'm not sure what this means. Audit how? It seems like doing the
merging during load is better so that display can be done with fewer
queries.
Sorry I was unclear. I mean auditing by comparing records in the
Watchdog database to records in the source (i.e., the original XML
documents published by the Senate), in case there's a question about
the accuracy of the Watchdog data. If you're not concerned about that,
then you're probably right that name normalization should be done
during import.
> I made a bunch of updates last night, too. I found a great resource on
> the House of Representatives website that answers many of the
> questions I had about how to interpret the data in the LDA
> database. I've got about 4 more things to add to the wiki, and then I
> think we'll have enough information to write a sensible parser. I'll
> try to make those final edits tonight.
I wasn't able to finish all of the edits on the wiki tonight, but I did update
a few things. I'll keep working on it tomorrow.
--
John Osborne
osbo...@ieee.org/osbo...@gmail.com/j...@freeshell.org
Sorry for the recent lack of updates. I added new info to the wiki
page on Sunday, but I haven't done anything since. I'm still committed
to delivering a parser, though.
I've figured out almost everything we need to make sense of the
database, but there are some records that don't make sense. For
example, here's part of a particular registration record:
<Lobbyists>
<Lobbyist LobbyistName="NEAL, KATIE" LobbyistStatus="0"
LobbyisteIndicator="2" xmlns="" />
<Lobbyist LobbyistName="NEAL, KATIE" LobbyistStatus="0"
LobbyisteIndicator="0" OfficialPosition="N/A" xmlns="" />
<Lobbyist LobbyistName="NEAL, KATIE" LobbyistStatus="0"
LobbyisteIndicator="1" OfficialPosition="COMM DIR/REP DINGELL"
xmlns="" />
</Lobbyists>
It says that Katie Neal is a lobbyist who previously worked for the
government in a covered position, who didn't previously work for the
government in a covered position, and whose covered position status is
unknown.
There are other records that have similar discrepancies, and not just
in terms of lobbyists' status. I'm not sure what to do about
them. I'll try contacting the SOPR with some examples to see if they
can explain them.
It's a work in progress, but the GitHub repo is here:
http://tinyurl.com/5vpofz
Suggestions or help are welcome!
There are only 3 more elements left to parse -- foreign orgs,
affiliated orgs and lobbying issues -- and then I'm finished with this
parser. As is, it can load into an sqlite3 database the complete set
of Senate LD-1/LD-2 disclosure data from 1999 until present, save for
the 3 missing element types. It takes about 50 minutes to parse and
import the entire history on a 2.4GHz MacBook Pro. (That execution
time will get slightly worse after I add the remaining 3 element
types.)
I haven't spent any time on the project in the last week, and won't be
able to resume the work until Monday, but I'll try to wrap it all up
by next weekend. The remaining work is easy.
Because this parser/importer is independent of any particular
"transparency" project and is intended to be used as a sort of
reference code base, it'll take some effort to make it ready for the
Watchdog Project, but the work should be relatively
straightforward. The only non-trivial bits that come to mind are
identifying all the common names for organizations like General
Electric, which will be tedious, and handling amendment records (see
the wiki for my thoughts on how those should be processed).
All of the code I've written to date for the reference parser is
complete, stable and tested, so if someone's looking for a task, he
could start integrating the reference code into Watchdog at any
time. That would at least get Watchdog started with the basic lobbying
information: what org hired which lobbying org(s), when and for how
much.
Nearly all of the records, so far, look reasonable, except for the
individual lobbyist data. It's a mess, and I don't trust it in many
cases. Another potential project for someone to work on is to parse
individual lobbyist data out of the House XML documents and merge it
with the Senate records. The House records from dates prior to around
2007 lack much of the interesting detail contained in their
corresponding Senate records, but the House records do appear to have
sane individual lobbyist information, at least.
Speaking of which, recent House records appear to be *more* detailed
than the Senate records, so it might also be a worthwhile project to
write a complete parser for the House database and use it for all
future records.
A couple of fields here and there are also suspect, but they're
generally redundant and/or not particularly useful or
interesting. These cases are documented in the unit tests, and I'll
write them up in the wiki, too, once I'm finished with the code.