On Sat, 9 May 2015 20:03:40 -0700
Tom Dooner <
tomd...@gmail.com> wrote:
> Hi all,
>
> It was good seeing many of you this morning at the CFA HQ. We made
> some good progress getting on the same page about goals and desired
> architecture. For anyone that couldn't make it, you can relive it
> vicariously through the minutes
> <
https://docs.google.com/document/d/16iJ30sORg2hBW9mKNPXsMs8uS4XvmB9OzHD-qGx8_Rg/edit>
I was sorry to have missed the meeting.
> We believe we have enough of an idea of where we want to go that
>
> *it's time to start building something together.*
Defn!
> *Next Saturday (May 16th),* will be the first (of hopefully many) hack
> sessions where we start building and designing where we're going from
> here. If you can't make it, fear not, there will be more
> opportunities. The agenda for this next week is simple: start working
> on the backend that will ETL the data from NetFile.
I will look forward to being there.
> Thanks to Peter and Code for SF, we will be having this hack day at
> the CFA HQ from *10am-2pm*.
>
> To review the architecture we're thinking about building (speak now or
> forever hold your peace):
>
> - We are building a shared backend across multiple jurisdictions
Are we going to start with the CCDC as a base? If we want to generalize
that project, adding modifiers for the particular government entity,
the CCDC people may be ok with that. Then their coe would be just a
single "CA" case, next to the "SJC", "SFC", "OAK" and "SCT" cases.
> - First we will focus on pulling all data from the NetFile API
> <
https://netfile.com/connect2>
> - Later we will focus on Cal-Access
> <
http://www.sos.ca.gov/campaign-lobbying/cal-access-resources/raw-data-campaign-finance-and-lobbying-activity/>
> - It is important to maintain a copy of the original source data
> - This will help us audit any transformations we apply to the data
> (deduplication, associations, name cleanup, etc.)
If you want to scope out the de-duplication task, see this table:
http://opencalaccess.org/calaccess_raw_20150509_000246_PDT/calaccess_campaign_browser_identity.sql.gz
FYI, the table at
http://opencalaccess.org/calaccess_raw_20150509_000246_PDT/calaccess_campaign_browser_identity_attribute.sql.gz
contains links from those names to the ID numbers assigned in the DIME
project (
http://data.stanford.edu/dime). If anyone wants to peruse it.
I found matches in the Cal-Access data for 87,778 names in the DIME
data.
> - This will also help us perform subsequent updates
> incrementally, which will be a helpful performance benefit
> - Python is our *lingua franca* due to the team's general
> familiarity with it as well as the many available data processing
> tools like dedupe and the CCDC Cal-Access apps.
> - We will be using a relational data store, likely PostgreSQL.
Technical merits aside, MySQL probably has more presence in the
database world. How strong is the preference here?
> - This will be a new repo under the opencalifornia github repo.
> Again, the intent of the hack session is to make some actual progress
> towards this. We have a lot of inertia on the partnerships side
Was that "inertia"? Or "momentum"? :-) We do not need more inertia....
> (props to John and Asha for your hard work on that), but now it's
> time for the programming rubber to meet the road of campaign finance!
> (Okay, so that was a weird way to say it, you get the idea.)
>
> If anyone has questions, or wants to discuss architectural details,
> here is a good place to do that.
> Best,
> Tom
Look forward to doing some code!
cheers - ray