proposed contribution schema v1.1

2 views
Skip to first unread message

Jeremy Carbaugh

unread,
May 27, 2009, 9:25:54 AM5/27/09
to Data Commons
Here is the proposed common contribution schema. This schema will be
used to store both state and federal contributions. For the federal
contributions, indiv-cand, pac-pac, and pac-cand contributions will
all be stored in the common schema.

Messy formatting but you'll get the gist of it.

cycle year of the campaign cycle
transaction_id reference to the record id from the CRP or
NIMSP data
transaction_type example: 22y (refund), 24z (in-kind
contribution), etc.
is_amendment True if transaction was from an amended
report
jurisdiction state or federal
amount dollar amount of contribution
datestamp date of the contribution
contributor reference to the contributor (person or PAC)
contributor_occupation occupation of contributor, both employer/
occupation for non-electronic records
contributor_employer employer of contributor if electronic record
organization organization to which the contributor is
related, typically employer
parent_organization parent of the organization if it exists
industry industry code of organization
sector sector code of organization
category specific category code of the organization
gender gender of contributor
address contributor address
city city of contributor
state state of contributor
zipcode zipcode of contributor
recipient recipient of contribution (PAC or candidate)
recipient_party political party to which recipient is related
recipient_is_pac True if the recipient is PAC
seat the seat being sought (US President, US
Senate, State Lower Chamber, etc.)
seat_status relation to seat being sought (challenger,
incumbent, etc)
seat_result whether the seat was won or lost
district the congressional or state-level district of
the seat
election_type primary or general

Clay Johnson

unread,
May 27, 2009, 9:34:50 AM5/27/09
to datac...@googlegroups.com
Let's have some meta-data about the system in here as well:

date_imported (date that this went into the data commons)
imported_by (person or system that imported the data)
version(?) it may be that we want to edit the data at some point. Adding a version column could make it so we could edit data, store it as a new row and just increment the version number by 1.

Is it your intention to have different tables for PACs and Candidates to reference from here via IDs?

--C
--
Clay Johnson
Director of Sunlight Labs
cjoh...@sunlightfoundation.com
AIM: knowpost
Google: clayjohnson
Calendar: http://bit.ly/xWvu
2afa8aa5f689ec8ef84f0911aab0895d

Jeremy Carbaugh

unread,
May 27, 2009, 9:47:27 AM5/27/09
to datac...@googlegroups.com
So even though we only have the Contribution schema right now, all of the schemas will contain a common set of metadata attributes. I'll start another thread to discuss those.

PACs, Candidates, and individuals will all be entities. The references in the the contribution schema are to generic entity records, not to tables of specific candidate or PAC types. Most of the important fields in these tables have been denormalized and included in the Contribution schema.

There are probably a few other desired fields that will are not currently included. We are still working on ideas for a entity attribute system that would allow for querying.

Jeremy

Clay Johnson

unread,
May 27, 2009, 4:21:51 PM5/27/09
to datac...@googlegroups.com

PACs, Candidates, and individuals will all be entities. The references in the the contribution schema are to generic entity records, not to tables of specific candidate or PAC types. Most of the important fields in these tables have been denormalized and included in the Contribution schema.

Can you explain this a bit more and the thought process here?

--C

Jeremy Carbaugh

unread,
May 27, 2009, 5:31:04 PM5/27/09
to datac...@googlegroups.com
At it's core a contribution is something giving money to something. If you want to know "Who gave money to Mitch McConnell?" you sometimes want to know who contributed regardless of whether they are a person or PAC. You could store the data either in one common format and link to generic entities or you could separate the individual records into two tables and have proper foreign key relationships to PAC and individual tables.

From a data retrieval perspective, (I presume) it is less expensive to filter out unneeded records when asking for only PAC contributions than it is to combine records from two distinct searches if the PAC and individual contributions are stored separately.

There is also an advantage to denormalized data when dealing with large data sets. Joins are big performance killers so the fewer you have to do, the better. Also Hadoop is optimized for denormalized flat files so storing the data in the same way will make processing much easier.

I believe that there is another advantage in flexibility. If a candidate donates money to another candidate, not through a PAC, but directly. Then in a normalized database that person would have two records: candidate and individual. Using a generic entity, not only can you use the same entity reference in both situations, but merging these multi-type entities from different datasets will just be a matter of merging one entity with the other.
Reply all
Reply to author
Forward
0 new messages